字幕記錄


00:00
all right everybody let's get started
00:05
today the paper four days I'm is aunt
00:10
Aurora paper which is all about how to
00:12
get a high-performance reliable database
00:17
going as a piece of cloud infrastructure
00:19
and itself built out of infrastructure
00:22
that Amazon itself makes available so
00:29
the reason why we're reading this paper
00:30
is that first of all it's a very
00:32
successful recent cloud service from
00:34
Amazon a lot of their customers use it
00:38
it shows sort of in its own way an
00:41
example of a very big payoff from clever
00:44
design table one which sort of
00:46
summarizes the performance shows that
00:48
relative to some other system which is
00:51
not very well explained the paper claims
00:53
to get a thirty five times speed up in
00:55
transaction throughput which is
00:57
extremely impressive this paper also
01:00
kind of explores the limits of how well
01:03
you can do for performance and fault
01:04
tolerance using general-purpose storage
01:06
because one of the themes of the papers
01:08
they basically abandoned general-purpose
01:10
storage they switch from a design in
01:12
which they were using their Amazon's own
01:14
general-purpose storage infrastructure
01:16
decided it was not good enough and
01:17
basically built totally
01:19
application-specific storage
01:22
furthermore the paper has a lot of
01:23
little tidbits about what turned out to
01:25
be important in this and the kind of
01:29
cloud infrastructure world so before
01:32
talking about aurora i want to spend a
01:35
bit of time kind of going over the back
01:36
history or what my impression is about
01:38
the story that led up to the design of
01:41
aurora because it's you know the sort of
01:43
m f-- way that amazon has in mind that
01:47
you ought to build that their cloud
01:49
customers ought to build databases on
01:51
amazon's infrastructure so in the
01:55
beginning amazon had basically their
02:02
very first offering cloud offering to
02:05
support people who wanted to build
02:06
websites but using Amazon's hardware and
02:09
in Amazon's machine room their first
02:11
offering was something called ec2
02:14
for elastic cloud apparently too and the
02:20
idea here is that Amazon had big machine
02:21
rooms full of servers and they ran
02:23
virtual machine monitors on their
02:25
servers and they'd rent out virtual
02:26
machines to their customers and their
02:30
customers would then you know rent a
02:32
bunch of virtual machines and run web
02:34
servers and databases and whatever ever
02:36
all else they needed to run inside these
02:39
ec2 instances so the picture of one
02:42
physical server looked like this Amazon
02:47
we control the virtual machine monitor
02:50
on this hardware server and then there'd
02:53
be a bunch of guests a bunch of ec2
02:54
instances each one rented out to a
02:57
different cloud customer each of these
02:59
would just run a standard operating
03:01
system like Linux and then you know a
03:06
web server or maybe a database server
03:11
and these were relatively cheap
03:14
relatively easy to set up and as a very
03:17
successful service so one little detail
03:22
that's extremely important for us is
03:23
that initially the way you get storage
03:28
the way you've got storage if you rented
03:30
an ec2 instance was that every one of
03:33
their servers had a disk attached a
03:35
physical disk attached and each one of
03:38
these instances that they rented to
03:41
their customers will get us you know a
03:43
slice of the disk so they said locally
03:46
attached storage and you got a bit of
03:48
locally attached storage which itself
03:50
just look like a hard drive an emulated
03:52
hard drive to the virtual machine guests
03:56
ec2 is like perfect for web servers for
04:00
stateless web servers you know your
04:02
customers with their web browsers would
04:04
connect to a bunch of rented ec2
04:07
instances that ran a web server and if
04:10
you added all of a sudden more customers
04:12
you could just instantly rent more ec2
04:14
instances from
04:15
Amazon and fire up web servers on them
04:17
and sort of an easy way to scale up your
04:20
ability to handle web load so it was
04:23
good for web servers
04:26
but the other main thing that people ran
04:27
in ec2 instance this was databases
04:30
because usually a website is constructed
04:32
of a set of stateless web servers that
04:34
anytime they need to get out permanent
04:37
data go talk to a back-end database so
04:40
what you would get is is maybe a bunch
04:43
of client browsers in the outside world
04:48
outside of Amazon's web infrastructure
04:50
and then a number of ec2 web server
04:56
instances as many as you need it to run
04:58
the sort of logic of the website this
05:00
this is now inside Amazon and then also
05:05
some also typically one ec2 instance
05:10
running a database your web servers
05:13
would talk to your database instance and
05:15
ask it to read and write records in the
05:16
database unfortunately ec2 wasn't
05:19
perfect was it nearly as well-suited to
05:22
running a database as it was to running
05:24
web servers and the most immediate
05:25
reason is that the storage or the sort
05:29
of main easy way to get storage for your
05:32
ec2 database instance was on the locally
05:35
attached disk attached to whatever a
05:39
piece of hardware your database instance
05:41
was currently running on in fact
05:44
hardware crashed then you also lost
05:46
access to whatever what is on its hard
05:48
drive so if it's a hardware that it was
05:51
actually implementing a web server
05:54
crashed no problem at all because
05:55
there's really keeps no state itself you
05:57
just fire up a new web server on a new
05:59
ec2 instance if the ec2 instance it's a
06:02
hardware running it crashes have become
06:04
unavailable you have a serious problem
06:06
if the data is stored on the locally
06:08
attached disk so initially at least
06:13
there wasn't sort of a lot of help for
06:15
doing this one thing that did work out
06:17
well is that Amazon did provide this
06:19
sort of large scheme for storing large
06:22
chunks of data called s3 and you could
06:24
take snapshots you could take Prius
06:25
periodic snapshots if you need a basis
06:27
state and stored in s3 and use that for
06:30
sort of backup disaster recovery but you
06:34
know that style of periodic snapshots
06:36
means you're gonna lose updates that
06:38
happen
06:39
between the periodic backups all right
06:43
so the next thing that came along that's
06:45
that's relevant to the sort of Aurora
06:47
database story is that in order to
06:51
provide their customers with disks for
06:55
their ec2 instances that didn't go away
06:57
if there was a failure that is more sort
06:59
of fault tolerant long-term storage was
07:01
guaranteed to be there Amazon introduced
07:04
the service called EBS and this stands
07:08
for elastic block store
07:09
so with EBS is is a service that looks
07:12
to an ec2 instances it looks to one of
07:16
these instances one of these guest
07:17
virtual machines just as if it were a
07:19
hard drive an ordinary way you could
07:21
format it as a hard drive but a file
07:24
system like ext3 or whatever Linux file
07:27
system you like on this on this thing
07:28
that looks to be guest just like a hard
07:30
drive but the way it's actually
07:31
implemented is as a replicated pair of
07:35
storage servers so this is the local
07:40
this is one of local storage with Mike
07:43
if when EBS came out then you could you
07:47
could rent an e BS volume which this
07:49
thing that looks just like an ordinary
07:50
hard drive but it's actually implemented
07:53
as a pair so these are EBS servers a
07:59
pair of EBS servers each with an
08:03
attached hard drive so if your software
08:09
here maybe you're running a database now
08:10
and your databases mount's one of these
08:13
EBS volumes as its storage when the
08:15
database server doesn't write what that
08:16
actually means is that the right to send
08:18
out over the network and using chain
08:19
replication which we talked about last
08:21
week you're right is you know first
08:24
written to the EBS server one on the
08:27
first CBS server that's backing your
08:28
volume and then the second one and
08:30
finally you get the reply and similarly
08:33
when you do a read I guess some chain
08:35
replication you'll be the last of the
08:37
chain so now database is running on ec2
08:41
instances had available a storage system
08:44
that actually would survive the crash of
08:46
or the you know death of the hardware
08:48
that they were running on if this
08:50
physical server died you could just get
08:53
another ec2 instance fire up your
08:55
database and have it attached to the
08:58
same old EBS volume that the sort of
09:01
previous version of your database was
09:03
attached to and it would see all the old
09:04
data just as it had been left off by the
09:07
previous database just like you moved a
09:10
hard drive from one machine to another
09:11
so EBS was like really a good deal for
09:14
people who need it to keep permanent
09:16
state like people running databases one
09:26
thing to that is sort of important for
09:29
us about EBS is that it's really it's
09:33
not a system for sharing at any one time
09:36
only one ec2 instance only one virtual
09:41
machine can mount a given EBS volume so
09:43
the EBS volumes are implemented on a
09:45
huge fleet of you know hundreds or
09:47
whatever storage servers with disks at
09:49
Amazon and they're all you know
09:52
everybody's EBS volumes are stored on
09:55
this big pool of servers but each one of
09:58
each PPS volume can only be used by only
10:01
one ec2 instance only one customer all
10:08
right still EBS was a big step up but it
10:13
had still has some problems so there's
10:18
still some things that are not quite as
10:19
perfect as it could be one is that if
10:22
you run a database on EBS it ends up
10:24
sending large volumes of data across the
10:27
network and this is uh we're now
10:31
starting to sort of sneak up on figure
10:33
two in the the paper where they start
10:36
complaining about how many just how many
10:38
writes it takes if you run a database on
10:40
top of a network storage system so
10:45
there's the database on EBS ended up
10:48
generating a lot of network traffic and
10:50
one of the kind of things in the paper
10:53
that the paper implies is that they're
10:55
as much network limited as they are CPU
10:59
or storage limited that is they pay a
11:01
huge amount of attention to reducing the
11:03
Aurora paper sends a huge amount of
11:05
attention for reducing the network
11:07
that the database generates and seems to
11:09
be worrying less about how much CPU time
11:12
or disk space is being consumed that's a
11:15
sort of a hint at what they think is
11:18
important the other problem with EBS is
11:20
not very fault tolerant it turns out
11:22
that for performance reasons they I'm
11:25
done would always put both of the EBS
11:26
both of the replicas of your EBS volume
11:29
in the same data center and so we have a
11:32
single server crashed if you know one of
11:34
the two EBS servers that you're using
11:36
crashed it's okay because you switch to
11:37
the other one but there was just no
11:39
story at all for what happens if an
11:40
entire data center went down and and
11:50
apparently a lot of customers really
11:53
wanted a story that would allow their
11:55
data to survive an outage of an entire
11:57
data center maybe it lost his network
12:00
connection it was a fire in the building
12:01
or a power failure to the whole building
12:04
or something people really wanted to
12:05
have at least the option if they're
12:07
willing to pay more of having their data
12:09
stored in a way they hid they could
12:10
still get at it I'm even if one data
12:13
center goes down and the way that Amazon
12:20
described this there is that both an
12:25
instance and its EBS to EBS replicas are
12:29
in the same ability veil ability zone
12:32
and an Amazon jargon an availability
12:34
zone is a particular data center and the
12:36
way they structure their data centers is
12:38
that there's usually multiple
12:42
independent data centers in more or less
12:44
the same city or relatively close to
12:46
each other and all the multiple
12:50
availability zones maybe two or three
12:52
that are near by each other are all
12:54
connected by redundant high speed
12:56
networks so there's always payers or
12:58
triples of nearby availability
13:00
availability centers and we'll see the
13:01
buy that's important in a little bit but
13:03
at least for EBS in order to keep the
13:05
sort of costs of using chain replication
13:08
down they required the two replicas to
13:12
be in the same availability zone
13:16
all right um before I dive into more
13:21
into how Aurora actually works it turns
13:27
out that the details of the design in
13:31
order to understand them we first have
13:32
to know a fair amount about the sort of
13:34
design of typical databases because what
13:36
they taken is sort of the main machinery
13:40
of a database my sequel as it happens
13:42
and split it up in an interesting way so
13:44
we need to know sort of what it but it
13:46
is a database does so we can understand
13:48
how they split it up so this is really a
13:50
kind of database tutorial really
13:58
focusing on what it takes to implement
14:01
transactions crashed recoverable
14:04
transactions so what I really care about
14:06
is transactions and crash recovery and
14:14
there's a lot else going on in databases
14:17
but this is really the part that matters
14:18
for this paper so first what's a
14:22
transaction you know transaction is just
14:24
a way of wrapping multiple operations on
14:27
maybe different pieces of data and
14:28
declare in that that that's entire
14:31
sequence of operations should appear a
14:33
Tomic to anyone else who's reading or
14:35
writing the data so you might see
14:38
transposing we're running a bank and we
14:40
want to do transfers between different
14:43
accounts maybe you would say well we
14:46
would see code or you know see a
14:48
transaction looks like this is you have
14:50
to clear the beginning of the sequence
14:51
of instructions that you want to be
14:53
atomic in the in transaction maybe we're
14:55
going to transfer money from account Y
14:59
to account X so we might see where I'll
15:02
just pretend X is a bank balance Jordan
15:05
the database you might see the
15:06
transaction looks like oh can I add $10
15:08
to X's account and
15:11
deduct the same ten dollars from my
15:14
account and that's the end of the
15:16
transaction
15:17
I want the database to just do them both
15:19
without allowing anybody else to sneak
15:21
in and see the state between these two
15:24
statements and also with respect to
15:27
crashes if there's a crash at this point
15:29
somewhere in here we're going to make
15:31
sure that after the crash and recovery
15:32
that either the entire transactions
15:34
worth the modifications are visible or
15:36
none of them are so that's the effect we
15:40
want from transactions there's
15:41
additionally people expect database
15:44
users expect that the database will tell
15:46
them tell the client that submitted the
15:48
transaction whether the transaction
15:51
really finished and committed or not and
15:52
if a transaction is committed we expect
15:55
clients expect that the transaction will
15:58
be permanent will be durable still there
15:59
even if the database should crash and
16:02
reboot um one thing it's a bit important
16:05
is that the usual way these are
16:08
implemented is that the transaction
16:10
locks each piece of data before it uses
16:12
it so you can view the they're being
16:15
locks x and y for the duration of the
16:20
transaction and these are only released
16:22
after the transaction finally commits
16:24
that is known to be permanent this is
16:29
important if you for some of the things
16:31
that you have to if you some of the
16:33
details in the paper really only makes
16:35
sense if you realize that the database
16:36
is actually locking out other access to
16:38
the data during the life of a
16:40
transaction so how this actually
16:43
implemented it turns out the database
16:48
consists of at least for the simple
16:53
database model where the databases are
16:55
typically written to run on a single
16:56
server with you know some storage
16:58
directly attached and a game that the
17:00
Aurora paper is playing is sort of
17:01
moving that software only modestly
17:05
revised in order to run on a much more
17:07
complex network system but the starting
17:09
point is we just assume we have a
17:11
database with a attached to a disk the
17:16
on disk structure that stores these
17:18
records is some kind of indexing
17:21
structure like a b-tree maybe so
17:24
there's a sort of pages with the paper
17:25
calls data pages that holds us you know
17:27
real data of the of the database you
17:32
know maybe this is excess balances and
17:34
this is wise balance these data pages
17:36
typically hold lots and lots of records
17:40
whereas X and y are typically just a
17:42
couple bites on some page in the
17:44
database so on the disk there's the
17:46
actual data plus on the disk there's
17:49
also a right ahead log or wal and the
17:55
right ahead logs are a critical part of
17:57
why the system is gonna be fault
18:00
tolerant inside the database server
18:03
there's the database software the
18:05
database typically has a cache of pages
18:08
that it's read from the disk that it's
18:11
recently used when you execute a
18:13
transaction what that actually executes
18:15
these statements what that really means
18:16
is you know what x equals x plus 10
18:19
turns into the runtime is that the
18:21
database reads the current page holding
18:23
X from the disk and adds 10 to it but so
18:27
far until the transaction commits it
18:29
only makes the modifications in the
18:31
local cache not on the disk because we
18:34
don't want to expose we don't want to
18:35
write on the disk yet and the part
18:37
possibly expose a partial transaction so
18:42
while then when the database but before
18:46
because the database wants to sort of
18:47
pre to clear the complete transaction so
18:50
it's available to the software after a
18:53
crash and during recovery before the
18:56
database is allowed to modify the real
18:57
data pages on disk its first required to
18:59
add log entries that describe the
19:03
transaction so it has to in order before
19:06
it can commit the transaction it needs
19:07
to put a complete set of log ahead
19:09
entries in the right ahead log on disk
19:11
I'm describing all the data bases
19:13
modification so let's suppose here that
19:15
x and y start out as say 500 and y
19:20
starts out as 750 and we want to execute
19:24
this transaction before committing and
19:26
before writing the pages the database is
19:29
going to add at least typically 3 log
19:31
records 1 this that says well as part of
19:34
this transaction I'm modifying X
19:37
and it's old value is 500 make more room
19:43
here this is the on dis log so each log
19:50
entry might say here's the value I'm
19:52
modifying here's the old value and we're
19:56
adding and here's the new value say five
19:58
ten so that's one log record another 4y
20:02
may be old value is 750 we're
20:04
subtracting 10 so the new value is 740
20:07
and then when the database if it
20:11
actually manages to get to the end of
20:12
the transaction before crashing its
20:14
gonna write a commit record saying and
20:18
typically these are all tagged with some
20:20
sort with a transaction ID so that the
20:23
recovery software eventually will know
20:24
how this commit record refers to these
20:27
log records yes
20:36
in a simple database will be enough to
20:38
just store the new values and say well
20:41
it is a crash we're gonna just reapply
20:43
all the new values the reason most
20:47
serious databases store the old as well
20:50
as a new value is to give them freedom
20:52
to even for a long-running traction for
20:56
a long-running transaction even before
20:57
the transaction is finished it gives the
20:59
database the freedom to write the
21:00
updated page to disk with the new value
21:04
740 let's say from the from an
21:07
uncompleted transaction as long as it's
21:10
written the log record to disk and then
21:11
if there's a crash before the commit the
21:13
recovery software always say aha well
21:15
this transaction never finished
21:17
therefore we have to undo all of its
21:19
changes and these values these old
21:21
values are the values you need in order
21:22
to undo a transaction that's been
21:24
partially written to the data pages so
21:26
the aurora indeed uses undo redo logging
21:32
to be able to undo partially applied
21:35
transactions okay so if the database
21:40
manages to get as far as getting the
21:42
transactions log records on the disk and
21:44
the commit record marking is finished
21:46
then it is entitled to apply to the
21:48
client we said the transactions
21:50
committed the database can reply to the
21:51
client and the client can be assured
21:53
that its transaction will be sort of
21:56
visible forever and now one of two
21:59
things happens the database server
22:01
doesn't crash then eventually so it's
22:04
modified in its cache these these X&Y
22:08
records to be 510 and 740 eventually the
22:13
database will write it's cached updated
22:15
blocks to their real places on the disk
22:18
over writing you know these be tree
22:20
nodes or something and then the database
22:22
can reuse this part of the log so
22:26
databases tend to be lazy about that
22:27
because they like to accumulate you know
22:30
maybe there'll be many updates to these
22:32
pages in the cache it's nice to
22:34
accumulate a lot of updates before being
22:37
forced to write the disk if the database
22:39
server crashes before writing the day
22:41
writing these pages to the disk so they
22:43
still have their old values then it's
22:47
guaranteed that the recovery software
22:49
when you restart that
22:49
debase scan the log see these records
22:53
for the transaction see that that
22:54
transaction was committed and apply the
22:58
new values to the to the stored data and
23:03
that's called a redo it basically does
23:07
all the rights in the transaction so
23:11
that's how transactional databases work
23:15
in a nutshell and so this is a sort of
23:18
very extremely abbreviated version of
23:22
how for example the my sequel database
23:25
works that an Aurora is based on this
23:28
open source software thing called
23:30
database called my sequel which does
23:32
crash recovery transaction and crash
23:34
recovery in much this way ok so the next
23:40
step in Amazon's development a better
23:44
and better database infrastructure for
23:46
its cloud customers is something called
23:50
RDS and I'm only talking about RDS
23:53
because it turns out that even though
23:55
the paper doesn't quite mention it
23:56
figure 2 in the paper is basically a
23:58
description of RDS so what's going on
24:01
and RDS is that it was a first attempt
24:04
to get a database that was replicated in
24:07
multiple availability zones so that if
24:09
an entire data center went down you
24:12
could get back your database contents
24:14
without missing any rights so that deal
24:16
with RDS is that there's one you have
24:20
one ec2 instance that's the database
24:22
server
24:23
you just have one you just want to
24:24
running one database it stores its data
24:28
pages and log just basically with this
24:31
instead of on the local disk its stores
24:34
them in EBS so whenever the database
24:36
does a log write or page write or
24:38
whatever those rights actually go to
24:40
these two EBS volumes EBS replicas in
24:47
addition so and so this is in one
24:50
availability zone in addition for every
24:54
write that the database software does
24:55
Amazon would transparently without the
24:58
database even realizing necessarily this
24:59
was happened also send those rights to
25:03
a special set up in a second
25:06
availability zone in a second machine
25:07
room - just going from figure 2 to
25:13
apparently a separate computer or ec2
25:14
instance or something whose job was just
25:16
a mirror writes that the main database
25:20
did so this other sort of mirroring
25:22
server would then just copy these rights
25:25
to a second pair of EBS servers and so
25:30
with this set up with this RDS set up
25:33
and that's what figure - every time the
25:36
database appends to the log or writes to
25:38
one of its pages it has to the data has
25:43
to be sent to these two replicas has to
25:44
be sent on the network connection across
25:47
the other availability zone on the other
25:49
side of town sent to this mirroring
25:51
server which would then send it to it's
25:53
two separate EBS replicas and then
25:56
finally this reply would come back and
25:59
then only then with the right be
26:00
finished with a DAT bc AHA my writes
26:03
finished I can you know count this log
26:07
record it was really being appendage of
26:08
the log or whatever
26:09
so this RDS arrangement gets you betcha
26:13
better fault tolerance because now you
26:14
have a complete up-to-date copy of the
26:17
database like seeing them all the very
26:18
latest writes in a separate availability
26:21
zone even if you know fire burns down
26:23
this entire data center boom you can
26:26
weaken you can run the database in a new
26:28
instance and the second availability
26:30
zone and lose no data at all yes
26:45
um I don't know how to answer that I
26:48
mean that is just not what they do and
26:51
my guess is that it would be that for
26:54
most EVs customers it would be too
26:56
painfully slow to forward every right
26:58
across two separate data center I'm not
27:02
really sure what's going on but I think
27:04
the main answers they don't do that and
27:06
this is sort of a a little bit of a
27:09
workaround for the way EBS works too
27:11
kind of tricky BS and actually producing
27:14
and sort of using the existing EBS
27:17
infrastructure unchanged I stableman
27:20
chose this turns out to be extremely
27:24
expensive or anyway it's expensive as
27:28
you might think
27:29
you know we're writing fairly large
27:30
volumes of data because you know even
27:33
this transaction which seems like it
27:36
just modifies two integers like maybe
27:38
eight bytes or I don't know what sixteen
27:41
who knows only a few bytes of data are
27:43
being modified here what that translates
27:45
to as far as the database reading and
27:46
writing the disk is I actually these log
27:49
records are that also quite small so
27:51
this these two log records might
27:53
themself only be dozens of bytes long so
27:55
that's nice but the reads and writes of
27:57
the actual data pages are likely to be
27:58
much much larger than just a couple of
28:02
dozen bytes because each of these pages
28:03
is going to be you know eight kilobytes
28:05
or 16 kilobytes or some relatively large
28:08
number the file system or disk block
28:10
size and it means that just to read and
28:14
write these two numbers when it comes
28:17
time to update the data pages there's a
28:19
lot of data being pushed around on to
28:21
the disk a locally attached disk now
28:23
it's reasonably fast but I guess what
28:26
they found is when they start sending
28:27
those big 8 kilobyte writes across the
28:30
network that that used up too much
28:34
network capacity to be supported and so
28:37
this arrangement this figure 2
28:40
arrangement evidently was too slow yes
28:51
so in this in this figure to set up the
28:56
you know unknown to the database server
28:58
every time it called write erode its EBS
29:02
disk a copy of every write went over
29:05
across availabilities zones and had to
29:08
be written to the was written to the
29:10
both of these EBS servers and then
29:12
acknowledged and only then did the write
29:15
appear to complete to the database so I
29:18
really had to wait for all the fall for
29:19
copies to be updated and for the data to
29:22
be sent on the link across to the other
29:24
availability zone and you know as far as
29:30
table one it's concerned that first
29:33
performance table the reason why the
29:39
reason why the slow the mirrored my
29:42
sequel line is much much slower than the
29:45
Aurora line is basically that it sends
29:47
huge amounts of data over these
29:50
relatively slow Network links and that
29:52
was the problem that was the performance
29:54
problem they're really trying to fix so
29:55
this is good for fault tolerance because
29:57
now we have a second copy and another
29:59
availability zone but it was bad news
30:02
for performance all right the way Aurora
30:05
and the next step after this is Aurora
30:07
and to set up there the high level view
30:14
is we still have a database server
30:15
although now it's running custom
30:18
software that Amazon supplies so I can
30:21
rent an Aurora server from Amazon but
30:23
it's not I'm not running my software on
30:26
it I'm renting a server running Amazon's
30:28
Aurora database software on it rent an
30:32
Aurora database server from them and
30:35
it's it's just one instance it sits in
30:38
some availability zone and there's two
30:44
interesting things about the way it's
30:46
set up first of all is that the data you
30:52
know it's replacement basically for EBS
30:54
involves six replicas now
30:59
- in each of three availability zones
31:09
for super fault tolerance and so every
31:12
time the database complicated we'll talk
31:14
but basically when the database writes
31:15
or reads when the database writes it's
31:19
we're not sure exactly how its managed
31:22
but it more or less needs to send a
31:24
write one way or another writes have to
31:27
get sent to all six of these replicas
31:31
the key to making and so this looks like
31:33
more replicas gosh you know why isn't it
31:35
slower why isn't it slower than this
31:37
previous scheme which only had four
31:38
replicas and the answer to that is that
31:41
what's being the only thing being
31:43
written over the network is the log
31:44
records so that's really the key to
31:47
success is that the data that goes over
31:50
these links in the sense of the replicas
31:51
it's just the log records log entries
31:58
and as you can see you know a log entry
32:02
here you know at least and this is a
32:04
simple example now it's not quite this
32:06
small but it's really not vastly more
32:08
than a couple of dozen bytes needed to
32:10
store the old value and the new value
32:11
for the piece of data we're writing so
32:14
the log entries tend to be quite small
32:16
whereas when the database you know we
32:20
had a database that thought it was
32:21
writing a local disk and it was updating
32:23
its data pages these tended to be
32:24
enormous like doesn't really say in the
32:26
paper I don't think that eight kilobytes
32:28
or more so this set up here was sending
32:31
for each transaction was sending
32:33
multiple 8 kilobyte pages across to the
32:36
replicas whereas this set up is just
32:38
sending these small log entries to more
32:41
replicas but the log entries are so very
32:43
much smaller than 8k pages that it's a
32:46
net performance win okay so that's one
32:51
this is like one of their big insights
32:56
is just in the log entries of course a
32:58
fallout from this is that their storage
33:00
system is now not very general purpose
33:01
this is a storage system that
33:03
understands what to do with my sequel
33:06
log entries right it's not just you know
33:09
EBS was a very general purpose just
33:11
emulated to disk you read them right
33:13
block's doesn't understand anything
33:15
about anything except for blocks this is
33:17
a storage system that really understands
33:19
that it's sitting underneath the
33:20
database so that's one thing they've
33:23
done is ditched general-purpose storage
33:25
and switched to a very application
33:28
specific storage system
33:31
the other big thing I'll also go into in
33:34
more detail is that they don't require
33:36
that the rights be acknowledged by all
33:40
six replicas in order for the database
33:43
server to continue instead the database
33:47
server can continue as long as a quorum
33:49
and which turns out to be for as long as
33:51
any four of these servers responds so if
33:54
one of these availability zones is
33:57
offline or maybe the network connection
33:59
to it is slow or maybe even just these
34:02
servers just happen to be slow doing
34:04
something else at the moment we're
34:05
trying to write the database server can
34:08
basically ignore the two slowest or the
34:12
two most dead of the server's when it's
34:14
doing it right so it only requires
34:16
acknowledgments from any four out of six
34:17
and then it can continue and so this
34:19
quorum scheme is the other big trick
34:25
they use to help them have more replicas
34:30
in more availability zones and yet not
34:33
pay a huge performance penalty because
34:35
they never have to wait for all of them
34:36
just the four fastest of the six
34:39
replicas so the rest of the lecture is
34:45
gonna be explaining first quorums and
34:47
then log entries and then this idea of
34:49
just sending log entries basically table
34:53
one summarizes the result if you look at
34:54
table one by switching from this
34:56
architecture in which they send the big
34:58
data pages to four places to this Aurora
35:02
schema sending just the log entries to
35:04
six replicas they get a amazing 35 times
35:08
performance increase over some other
35:11
system you know this system over here
35:15
but by playing these two tricks and
35:17
paper is not very good about explaining
35:19
how much of the performance is due to
35:21
quorums and how much is due to just
35:23
sending log entries but anyway you slice
35:25
it 35
35:27
times improvement performance is very
35:31
respectable and of course extremely
35:33
valuable to their customers and to them
35:34
and it's like transformative I am sure
35:37
for many of Amazon's customers all right
35:44
okay so the first thing I want to talk
35:46
about in in detail is their quorum
35:50
arrangement what they actually mean by
35:52
quorums so first of all the quorums is
35:55
all about the arrangement of
35:57
fault-tolerant of this fault-tolerant
35:59
storage so it's worth thinking a little
36:03
bit about what their fault tolerance
36:05
goals were so this is like fault
36:09
tolerance goals they wanted to be able
36:15
to do rights even if one reads and
36:18
writes even if one availability zone was
36:21
completely dead so they're gonna write
36:26
you know even with they wanted to be
36:37
able to read even if there was one dead
36:40
availability zone plus one other dead
36:43
server and the reason for this is that
36:46
an availability zone might be offline
36:48
for quite a while because maybe it's you
36:50
know was suffered from a flood or
36:52
something and while it's down for a
36:54
couple of days or a week or something
36:56
well people prepare the damage from the
36:58
flood we're now reliant on just you know
37:00
the servers and the other two
37:01
availability zones if one of them should
37:03
go down we still we don't want it to be
37:05
a disaster so they're going to be able
37:09
to write with one even with one dead
37:11
availability zone they furthermore they
37:13
wanted to be able to read with one dead
37:16
availability zone plus one other dead
37:19
server so they wanted to be able to
37:20
still read you know and get the correct
37:23
data even if there was one dead
37:26
availability zone plus one other server
37:28
and the live availability zones were
37:30
dead so you know they we have to sort of
37:34
take take it for granted that they know
37:36
what their they know their own business
37:38
and that this is really
37:40
you know kind of a sweet spot for how
37:43
fault-tolerant you want to be um and in
37:46
addition I already mentioned they want
37:47
to be able to taller to sur ride out
37:49
temporarily slow replicas I think from a
37:55
lot of sources it's clear that the if
37:58
you read and write EBS for example you
38:01
don't get consistently high performance
38:03
all the time sometimes there's little
38:04
glitches because maybe some part of the
38:06
network is overloaded or something is
38:08
doing a software upgrade or whatever and
38:10
it's temporarily slow so they want to be
38:13
able to just keep going despite
38:15
transient transiently slow or maybe
38:21
briefly unavailable storage servers and
38:27
a final requirement is that if something
38:30
if a storage server should fail it's a
38:33
bit of a race against time before the
38:36
next storage server fails sort of always
38:39
the case and it's not the statistics are
38:42
not as favorable as you might hope
38:44
because typically you buy basically
38:47
because server failure is often not
38:50
independent like the fact that one
38:53
server is down often means that there's
38:56
a much increased probability that
38:58
another one of your servers will soon go
39:00
down because it's identical Hardware may
39:03
be bought from the same company came off
39:05
the same production line one after
39:07
another and so a flaw and one of them is
39:09
extremely likely to be reflected in a
39:11
flaw and another one so people always
39:14
nervous off there's one failure boy
39:16
there could be a second failure very
39:17
soon and in a system like this well it
39:21
turns out in these quorum systems you
39:24
know you can only recover it's a little
39:26
bit like raft you can recover as long as
39:28
not too many of the replicas fail so
39:31
they really needed to have fast we
39:35
replicate them that is of one server
39:37
seems permanently dead we'd like to be
39:38
able to generate a new replica as fast
39:41
as possible from the remaining replicas
39:43
I mean a fast food replication
39:48
these are the main fault tolerance goals
39:50
the peeper lays out and by the way this
39:56
discussion is only about the storage
39:57
servers and you know what their failure
40:00
character is too excited you know the
40:01
failures how to recover and it's a
40:03
completely separate topic what to do if
40:05
the database server fails and Aurora has
40:10
a totally different set of machinery for
40:17
noticing a database servers fail
40:19
creating a new instance running in a new
40:20
database server on the new instance
40:22
which is intense it's not what I'm
40:24
talking about right now we'll talk about
40:25
it a little bit later on right now it's
40:27
just gonna build a storage system that's
40:29
a lot that's where the storage system is
40:32
fault tolerant okay so they use this
40:36
idea called quorums and for a little
40:43
while now I'm going to describe the sort
40:45
of classic quorum idea which is dates
40:49
back to the late 70s so this is quorum
40:52
replicate quorum replication I'm gonna
40:57
describe to you this or abstract quorum
40:59
idea they use a variant of what I'm
41:02
gonna explain and the idea of behind
41:05
quorum quorum systems is to be able to
41:07
build storage systems that provide fault
41:10
tolerance storage using replications and
41:13
guarantee that even if some of the
41:15
replicas fail your that reads will still
41:18
see the most recent writes and typically
41:22
quorum systems are sort of simple
41:25
readwrite systems put get systems and
41:28
they don't typically directly support
41:31
more complex operations just you can
41:33
read you could have objects you can read
41:34
an object or you can overwrite an entire
41:36
object and so the idea is you have n
41:38
replicas if you want to write or you
41:48
have to get you have to in order to
41:49
write you have to make sure your write
41:51
is acknowledged by W where W is less
41:53
than n of the replicas so W
41:58
right you have to send each right to
42:02
these W are the replicas and if you want
42:04
to do a read you have to get input read
42:07
information from at least our replicas
42:14
and so a typical setup that's so well
42:20
first of all the key thing here is that
42:23
W and our have to be set relative to end
42:27
so that any quorum of W servers that you
42:31
manage to send a right to must
42:33
necessarily overlap with any quorum of
42:36
our servers that any future reader might
42:38
read from and so what that means is that
42:42
our plus W has to be greater than n so
42:50
that any W servers must overlap in at
42:52
least one server with any our servers
42:58
and so you might have three we can
43:01
imagine there's three servers s1 s2 s3
43:05
each of them holds I say we just have
43:08
one object that we're updating we send
43:10
out a write maybe we want to set the
43:11
value of our object to 23 well in order
43:15
to do a write we need to get our new
43:17
value on to at least W of the of the
43:22
replicas let's say for this system that
43:24
R and W are both equals 2 and n is equal
43:29
to 3 that's the setup to do a write we
43:32
need to get our new value onto a quorum
43:35
onto a beast to the server so maybe we
43:38
get our right onto these two so they
43:40
both now know that the value of the of
43:43
our data object is 23 if somebody comes
43:47
along and reads or read it also requires
43:51
that the reader check with at least a
43:53
read quorum of the servers so that's
43:55
also 2 in this set up so you know that
43:58
quorum could include a server that
44:00
didn't see the right but it has to
44:02
include at least one other in order to
44:03
get to so that means the any future read
44:07
must for example consult both this
44:09
server that didn't see the write plus at
44:11
least one that did
44:12
that is a requirement of right form must
44:14
overlap in at least one server so any
44:17
read must consult a server that saw any
44:20
previous right now what's cool about
44:31
this well actually there's still one
44:34
critical missing piece here the reader
44:38
is gonna get back our results possibly
44:41
are different results because and the
44:44
question is how does a reader know which
44:46
of the our results it got back from the
44:48
our servers in its forum which one
44:51
actually uses the correct value
44:55
something that doesn't work is voting
44:57
like just voting by popularity of the
44:59
different values it gets back it turns
45:01
out not to work because we're only
45:03
guaranteed that our reader overlaps of
45:05
the writer in at most one server so that
45:07
could mean that the correct value is
45:09
only represented by one of the servers
45:11
that the reader consulted and you know
45:15
in a system with say six replicas you
45:17
know you might have Reaper might be four
45:19
you might get back for answers and only
45:23
one of them is the answer that is the
45:26
correct answer from the server in which
45:29
you overlap with the previous right so
45:31
you can't use voting and instead these
45:33
quorum systems need version numbers so
45:35
every right every time you do a right
45:38
you need to accompany your new value
45:40
with you know an increasing version
45:42
number and then the reader it gets back
45:45
a bunch of different values from the
45:47
read quorum and it can just use them
45:48
only the highest version number I'm said
45:51
that means that this 21 here
45:53
you know maybe s2 had a old value of 20
45:57
each of these needs to be tagged with a
45:59
version number so maybe this is version
46:01
number three this was also version
46:03
number three because it came from the
46:04
same original right and we're imagining
46:06
that this server that didn't see the
46:08
right is gonna have version number two
46:09
then the reader gets back these two
46:11
values these two version numbers fix the
46:13
version were the highest the value with
46:15
the highest version number and in Aurora
46:18
this was essentially about well never
46:23
mind about Aurora for a moment
46:28
okay furthermore if you can't talk to if
46:33
you can't actually contact a quorum or a
46:35
read or write you really just have to
46:37
keep trying those are the rules so keep
46:41
trying until the server's are brought
46:45
back up or connected again so the reason
46:49
why this is preferable to something like
46:51
chain replication is that it can easily
46:54
ride out temporary dead or disconnected
46:59
or slow servers so in fact the way it
47:01
would work is that if you want to read
47:02
or write if you want to write you would
47:04
saying your newly written about you
47:06
would send the newly written value plus
47:08
its version number to all of the servers
47:11
to all n of the servers but only wait
47:13
for W of them to respond and similarly
47:17
if you want to read you would in a
47:18
quorum system you would send the read to
47:20
all the servers and only wait for a
47:21
quorum for R of the servers to respond
47:23
and that and because you only have to
47:26
wait for are out of n of them that means
47:29
that you can continue after the fastest
47:31
are have responded or the fastest W and
47:35
you don't have to wait for a slow server
47:37
or a server that's dead and there's not
47:39
any you know the machinery for ignoring
47:43
slow or dead servers is completely
47:45
implicit there's nothing here or about
47:47
oh we have to sort of make decisions
47:49
about which servers are up or down or
47:51
like the leaders or anything it just
47:54
kind of automatically proceeds as long
47:57
as the quorum is available so we get
48:02
very smooth handling of dead or slow
48:04
servers in addition there's not much
48:07
leeway for it here well actually you
48:09
even in this simple case you can adjust
48:11
the R and W to make either reads to
48:14
favor either reads or writes so here we
48:17
could actually say that well the right
48:19
forum is three every write has to go to
48:21
all three servers and in that case the
48:23
read quorum can be want so you could if
48:26
you wanted to favored reads with this
48:28
setup you could have read equals one
48:31
write equals three memories are much
48:33
faster they only have to wait for one
48:35
server but then return the writes are
48:37
slow if you wanted to favor right
48:38
you could say that Oh any reader has to
48:40
be from all of them but a writer only
48:42
has to write one so I mean the only one
48:45
server might have the latest value but
48:48
readers have to consult all three but
48:53
they're guaranteed that their three will
48:55
overlap with this of course these
48:57
particular values makes writes not fault
49:00
tolerant and here reads not fault
49:02
tolerant because all the server's have
49:04
to be up so you probably wouldn't want
49:06
to do this in real life you might have
49:08
you would have as Knowle Rohrer does a
49:10
larger number of servers and sort of
49:13
intermediate numbers of vinum right
49:15
corns Aurora in order to achieve its
49:23
goals here of being able to write with
49:26
one debt availability zone and read with
49:30
one dead availability zone plus one
49:32
other server it uses a quorum system
49:35
with N equals 6 w equals 4 and R equals
49:45
3 so the W equals 4 means that it can do
49:48
a write with one dead availability zone
49:51
if this availability zone can't be
49:53
contacted well these other four servers
49:54
are enough to complete right the reform
49:58
of 3 so 4 plus week so 7 so they
50:01
definitely guaranteed overlap a read
50:04
quorum of 3 means that even if one
50:05
availability is zone is dead plus one
50:07
more server the three remaining servers
50:09
are enough to serve a read now in this
50:12
case we're three servers are now down
50:15
the system can do reads and as you know
50:17
can reconstruct the confine the current
50:20
state of the database but it can't do
50:21
writes without further work so if they
50:24
were in a situation where there was
50:28
three dead servers there they have
50:31
enough of a quorum to be able to read
50:33
the data and reconstruct more cop more
50:35
replicas but until they've created more
50:38
replicas to basically replace these dead
50:41
ones they can't serve as rights
50:47
and also the quorum system as I
50:50
explained before allows them to ride out
50:52
these transient slow replicas all right
51:02
as it happens as explained before what
51:07
the rights in Aurora aren't really over
51:09
writing objects as in a sort of classic
51:12
quorum system what Aurora in fact its
51:16
rights never overwrite anything its
51:20
rights just append log entries to the
51:22
current law
51:23
so the way it's using quorums is
51:25
basically to say well when the database
51:27
sends out our new log record because
51:29
it's executing some transaction it needs
51:31
to make sure that that log record is
51:33
present on at least four of the store of
51:38
its storage servers before it's allowed
51:40
to proceed with the transaction are
51:42
committed so that's really the meaning
51:43
of its other Wars right porins is that
51:46
each new log record has to be appended
51:49
to the storage and at least for the
51:50
replicas before the write can be
51:52
considered to to have completed and when
52:01
a when Aurora gets to the end of a
52:03
transaction before it can reply to the
52:05
client until the client tell the client
52:07
a hi you know your transaction is
52:08
committed and finished and durable
52:10
Aurora has to wait for acknowledgments
52:14
from a write quorum for each of the log
52:16
records that made up that transaction
52:18
and in fact because because if there
52:24
were a crash in a recovery you're not
52:25
allowed to recover one transaction if
52:30
preceding transactions don't aren't also
52:33
recovered in practice Aurora has before
52:36
Aurora can acknowledge a transaction it
52:38
has to wait for a write quorum of
52:42
storage servers to respond for all
52:44
previously committed transaction and the
52:46
transaction of interest and then can
52:48
respond to the client
52:54
okay so these these storage servers are
52:57
getting incoming log records
52:59
that's what rights look like to them and
53:02
so what do they actually do you know
53:04
they're not getting new data pages from
53:06
the database server they're just getting
53:07
log records that just describe changes
53:10
to the data pages so internally one of
53:16
these one of these storage servers it
53:22
has internally it has copies of all that
53:25
data of all the data pages at some point
53:30
in the database data pages evolution so
53:34
it has maybe in its cache on its disk a
53:39
whole bunch of these pages you know page
53:41
1 page 2 so forth when a new write comes
53:47
in the storage server would win a new
53:52
log rec over in a new write arrives
53:54
carrying with it just a log record what
53:56
has to happen some day but not right
53:58
away is that the changes in that log
54:00
record the new value here has to be
54:02
applied to the relevant page but we
54:05
don't at the source of it doesn't have
54:06
to do that until someone asks just until
54:09
the database server or the recovery
54:11
software asks to see that page so
54:13
immediately what happens to a new log
54:15
record is that the log records are just
54:17
appended to lists of log records that
54:20
effect each page so for every page that
54:23
the storage server stores if it's been
54:26
recently modified by a log record by a
54:29
transaction what the storage server will
54:31
actually store is an old version of the
54:34
page plus the string of the sequence of
54:37
log records that have come in from trend
54:40
from the database server since that page
54:42
was last brought up to date so if
54:45
nothing else happens the storage server
54:47
just stores these old pages plus lists
54:49
of log records if the database server
54:53
later you know fix the page from its
54:56
cache and then needs to read the page
54:58
again for a future transaction it'll
55:00
send a read request out to one of the
55:03
storage servers and say look you know I
55:04
need a copy I need an updated copy a
55:06
page one
55:06
and at that point the storage server
55:09
will apply these log records to the page
55:12
you know do do these writes of new data
55:15
that are implied that are described in
55:18
the log records and then send that
55:19
updated page back to the database server
55:22
and presumably maybe then like a racist
55:27
list and just store the newly updated
55:29
page although it's not quite that simple
55:35
all right so the storage servers just
55:37
store these strings of log records plus
55:41
old log page versions now the database
55:53
server as I mentioned sometimes needs to
55:54
read pages so by the way one thing to
55:57
observe is that the database server is
55:58
writing log records but it's reading
56:00
data pages so there's also different my
56:03
corns poram system in the sense that the
56:05
sort of things that are being read and
56:07
written are quite different in addition
56:09
it turns out that in ordinary operation
56:11
the database server knows doesn't have
56:16
to send quorum reads because the
56:20
database server tracks for each one of
56:23
the storage servers how far how much of
56:27
the prefix of the log that storage
56:29
server is actually received so the
56:32
database server is keeping track of
56:34
these six numbers so so first of all log
56:36
entries are numbered just one two three
56:37
four five the database server sends that
56:40
new log entries to all the storage
56:42
servers the storage servers that receive
56:44
them respond saying oh yeah I got log
56:45
entries 79 and furthermore you know I
56:48
have every log entry before 79 also the
56:51
database server keeps track of these
56:52
numbers how far each server has gotten
56:56
or what the highest sort of contiguous
56:59
log entry number is that each of the
57:02
servers has gotten so that way when the
57:04
database server needs to do a read it
57:06
just picks a storage server that's up to
57:09
date and sends the read request for the
57:12
page it wants just to that storage
57:14
server so the the database server does
57:18
have to do quorum writes but it
57:19
basically
57:20
doesn't ordinarily have to do quorum
57:22
reads and knows which of these storage
57:23
servers are up to date and just reads
57:25
from one of them so the reason I keep ur
57:27
than they would be in a that just reads
57:30
one copy of the page and doesn't have to
57:32
go through the expense of a quorum read
57:36
now it does sometimes use quorum reads
57:39
it turns out that during crash recovery
57:41
you know if the crash during crash
57:44
recovery of the database server and so
57:46
this is different from a crash recovery
57:49
of the storage service if the database
57:50
server itself sir crash in me because
57:53
the it's running in an ec2 instance on
57:55
some piece of hardware some real piece
57:57
of hardware may be that piece of
57:58
hardware suffers a failure the database
58:01
server crashes there's some monitoring
58:02
infrastructure at Amazon that says oh
58:04
wait a minute you know the database the
58:06
Aurora database server over running for
58:08
a customer or whatever just crashed and
58:12
Amazon will automatically fire up a new
58:15
ec2 instance start up the database
58:18
software and that ec2 instance and sort
58:20
of tell it look your data is sitting on
58:23
this particular volume this set of
58:26
storage systems please clean up any
58:29
partially executed transactions that are
58:32
evident in the logs stored in these
58:34
storage servers and continue so we have
58:38
to and that's the point at which Aurora
58:44
uses quorum logic for weeds because this
58:48
database server when the old when the
58:52
previous database server crashed it was
58:54
almost certainly partway through
58:56
executing some set of transactions so
58:59
the state of play at the time of the
59:00
crash was well it's completed some
59:01
transactions and committed them and
59:03
their log entries are on a quorum plus
59:06
it's in the middle of executing some
59:09
other set of transactions which also may
59:12
have log entries on on a quorum but
59:14
because a database server crashed midway
59:16
through those transactions they can
59:18
never be completed and for those
59:23
transactions that haven't completed in
59:24
addition there may be you know we may
59:27
have a situation in which you know maybe
59:30
log entry this server has log on three
59:33
hundred
59:33
and the Surrey has logon 302 and there's
59:36
a hundred and four somewhere but no you
59:41
know for I as yet uncommitted
59:42
transaction before the crash made me
59:44
know server got a copy of log entry 103
59:48
so after a crash and remember the new
59:52
database service recovering it does
59:54
quorum reads to basically find the point
59:56
in the log the highest log number for
59:59
which every preceding log entry exists
60:02
somewhere in the storage service so
60:04
basically it finds the first missing the
60:07
number of the first missing log entry
60:08
which is 103 and says well and so we're
60:12
missing a log entry we can't do anything
60:14
with a log after this point because
60:16
we're like missing an update so the
60:20
database server does these quorum reads
60:21
it finds a hundred and three is the
60:23
first entry that's MIT that's I can't
60:27
you know I look at my quorum the
60:28
server's I can reach and 103 is not
60:31
there and the database server will send
60:32
out a message to all the server saying
60:34
look please just discard every log entry
60:37
from 103 onwards and those mussels
60:39
necessarily not include log entries from
60:43
committed transactions because we know a
60:45
transaction can't commit until all of
60:46
its entries are on a right corner so we
60:49
would be guaranteed to see them so we're
60:50
only discarding log entries from
60:53
uncommitted transactions of course so
60:58
we're sort of cutting off the log here
60:59
at login 302 these log entries that
61:03
we're preserving now may actually
61:04
include log entries from uncommitted
61:07
transactions from transactions that were
61:08
interrupted by the crash and the
61:10
database server actually has to detect
61:12
those which you can by seeing a hope you
61:14
know a certain transaction there's it
61:16
has update entries in the log but no
61:18
commit record the database server will
61:20
find the full set of those uncompleted
61:22
transactions and basically issue undo
61:25
operations I sort of knew log entries
61:28
that undo all of the changes that that
61:32
that those uncommitted transactions made
61:35
and you know that's the point at which
61:38
Aurora needs this these old values in
61:41
the log entries so that a
61:44
server that's doing recovery after a
61:46
crash can sort of back out of partially
61:49
completed transactions all right one
62:00
another thing I'd like to talk about is
62:04
how Aurora deals with big databases so
62:09
so far I've explained the storage setup
62:13
as if the database just has these six
62:17
replicas of its storage and if that was
62:20
all there was to it basically a database
62:22
couldn't be you know each of these just
62:23
a computer with a disk or two or
62:25
something attached to it if this were
62:28
the way the full situation then we
62:31
couldn't have a database that was bigger
62:32
than the amount of storage that you
62:34
could put on a single machine there's
62:36
the fact that we have six machines
62:37
doesn't give us six times as much usable
62:39
storage because each one I'm storing a
62:41
replica of the same old data again and
62:43
again and you know so I want to use
62:46
solid-state drives or something we can
62:48
put you know terabytes of storage on a
62:50
single machine but we can't put you know
62:53
hundreds of terabytes on a single
62:55
machine so in order to support customers
62:59
who need like more than ten terabytes
63:02
who need to have vast databases Amazon
63:06
is happy Amazon will split up the
63:09
databases data onto multiple sets of six
63:12
replicas so and the kind of unit of
63:19
sharding the unit of splitting up the
63:21
data I think is 10 gigabytes so a
63:23
database that needs 20 gigabytes of data
63:25
will use two protection groups these
63:28
these PG things to its data you know sit
63:32
on half of it will sit on the six
63:35
servers of protection Group one and then
63:41
they'll be another six servers you know
63:44
possibly a different set of six storage
63:46
servers because Amazon's running and
63:48
like a huge fleet of these storage
63:49
servers that are jointly used by all of
63:51
its Aurora customers the second ten
63:54
gigabytes of the databases 20 gigabytes
63:57
of data
63:58
we'll be replicated on another set of
64:02
you know typically different I'll you
64:05
know there could be overlap between
64:06
these but typically just a different set
64:08
of six server so now we get 20 gigabytes
64:11
a day done and we have more of these as
64:15
a database goes bigger one interesting
64:18
piece of fallout from this is that while
64:21
it's clear that you can take the data
64:25
pages and split them up over multiple
64:28
independent protection groups maybe you
64:30
know odd numbered data pages from your
64:32
b-tree go on PG one and even number
64:35
pages go on PG - it's good you can shard
64:38
split up the data pages it's not
64:40
immediately obvious what to do with a
64:41
log all right how do you split up the
64:44
log if you have two of these two
64:46
protection groups or more in a mantra
64:48
tection group and the answer that amazon
64:52
does is that that that Aurora uses is
64:54
that the database server when it's
64:55
sending out a log record it looks at the
64:57
data that the log record modifies and
64:59
figures out which protection groups
65:03
store that data and it sends each log
65:06
record just to the protection groups
65:08
that store data that's mentioned that's
65:11
modified in the log entry and so that
65:14
means that each of these protection
65:16
groups store some fraction of the data
65:19
pages plus all the log records that
65:22
apply to those data pages see these
65:25
protection groups stores a subset of a
65:27
log that's relevant to its pages so a
65:36
final maybe I erase the photons
65:41
requirements but a final requirement is
65:43
that if a if ass one of these storage
65:48
servers crashes we want to be able to
65:50
replace it as soon as possible right
65:53
because you know if we wait too long
65:55
then we risk maybe three of them are
65:57
four of them crashing and a four of them
65:58
crash then we actually can't recover
66:01
because then we don't have a reform
66:02
anymore so we need to regain replication
66:05
as soon as possible if you think about
66:08
any one storage server sure this this do
66:11
which server is storing 10 gigabytes for
66:13
you know my databases protection group
66:15
but in fact the physical thing you know
66:17
the physical setup of any one of these
66:19
servers is that it has a you know maybe
66:21
a one or two or something
66:23
terabyte disk on it that's storing 10
66:26
gigabyte segments of a hundred or more
66:31
different Aurora instances so what's
66:34
what's on this physical machine is you
66:37
know 10 terabyte era byte or 10
66:39
terabytes or whatever of data in total
66:42
so when there's a when one of these
66:44
storage servers crashes it's taking with
66:47
it not just the 10 gigabytes from my
66:50
database but also 10 gigabytes from a
66:53
hundred other people's databases as well
66:55
and what has to be replicated is not
66:58
just my 10 gigabytes but the entire
67:00
terabyte or whatever or more that's
67:03
stored on this servers solid-state drive
67:05
and if you think through the numbers you
67:08
know maybe we have 10 gigabit per second
67:10
network interfaces if we need to move 10
67:15
terabytes across a 10 gigabyte per
67:18
second network interface from one
67:19
machine to another it's gonna take I
67:22
don't know a thousand seconds ten
67:25
thousand seconds maybe ten thousand
67:26
seconds and that's way too long right we
67:31
don't want to have to sit there and wait
67:32
you know it we don't want to have a
67:34
strategy in which the way we weak we can
67:37
reconstruct this is to find is to have
67:40
another machine that was replicating
67:41
everything on it and had that machine
67:43
send 10 terabytes to a replacement
67:46
machine we're gonna be able to
67:48
reconstruct the data far faster than
67:50
that and so the actual setup they use is
67:52
that if I have a particular storage
67:56
server
67:57
it stores many many segments you know
68:01
replicas of many 10 gigabyte protection
68:04
groups so maybe this protection group
68:07
maybe this segment that it's storing
68:09
data for the other envy for this one the
68:12
other replicas are you know these five
68:17
other machines all right so these are
68:19
all storing
68:22
segments of protection group a and so
68:25
you know there's a whole bunch of other
68:26
ones that we're also storing so I mean
68:27
we may be this particular machine also
68:29
stores a replica for protecting group B
68:33
but the other copies of the data for B
68:36
are going to be put on a disjoint set of
68:38
servers right so now there's five
68:41
servers that have the other copies of B
68:43
and so on for all of the segments that
68:48
this server that are sitting on this
68:50
storage servers hard drive for you know
68:52
many many different Aurora instances so
68:55
that means that this machine goes down
68:57
the replacement strategy is that we pick
69:00
if we're say we're storing a hundred of
69:01
these segments on it we pick a hundred
69:04
different storage servers each of which
69:09
is gonna pick up one new segment that is
69:13
each of which is going to now be
69:14
participating in one more protection
69:17
group so one one we miss like one server
69:20
to be replicate on for each of these ten
69:22
gigabytes segments and now we have you
69:24
know maybe 100 sort of different segment
69:28
servers and you know I probably storing
69:29
other stuff but they have a little bit
69:30
of free disk space and then for each of
69:32
these we pick one machine one of the
69:35
replicas that we're going to copy the
69:38
data from one of the remaining replicas
69:39
so maybe for a we're going to copy from
69:41
there for B from here you know if we
69:43
have five other copies with C we pick a
69:47
different server for C and so we have we
69:50
copy a from this server to that server
69:53
and B like this and C like this and so
69:57
now we have a hundred different 10
70:01
gigabyte copies going on in parallel
70:03
across the network and assuming you know
70:07
we have enough servers that these can
70:09
all be disjoint and we have plenty of
70:11
bandwidth in switching network that
70:14
connects them now we can copy our
70:17
terabyte or 10 terabytes or whatever of
70:20
data and total in parallel with a
70:23
hundredfold parallelism and the whole
70:25
thing will take you know 10 seconds or
70:27
something instead of taking a thousand
70:29
seconds if there were just two machines
70:30
involved anyway so this is
70:34
this is the strategies they use and it
70:35
means that they can recover you know for
70:37
machine dies they can recover in
70:39
parallel from one machine's death
70:41
extremely quickly if lots of machines
70:45
diets doesn't work as well but they can
70:49
recover from single they can be
70:50
replicate from single machine crashes
70:52
extremely quickly alright so a final
70:58
thing that the paper mentions if you
70:59
look at figure three you'll see that not
71:02
only do they have this main database but
71:06
they also have replica databases so for
71:09
many of their customers many of their
71:12
customers see far more read-only queries
71:14
than they see readwrite queries that is
71:17
if you think about a web server if you
71:19
just view a web page on some website
71:21
then chances are the web server you
71:24
connected to has to read lots and lots
71:25
and stuff in order to generate all the
71:28
things that are shown on the page to you
71:30
maybe hundreds of different items have
71:32
to be read out of the database or so out
71:33
of some database but the number of
71:35
writes for a typical web page view is
71:37
usually much much smaller maybe some
71:39
statistics have to be updated or a
71:41
little bit of history for you or
71:42
something so you might have a hundred to
71:44
one ratio of reads to writes that is you
71:48
may typically have a large large large
71:50
number of straight read only database
71:54
queries now with this set up the writes
71:57
can only go through the one database
71:59
server because we really can only
72:01
support one writer for this storage
72:03
strategy and I think you know one place
72:06
where the rubber really hits the road
72:07
there is that the log entries have to be
72:09
numbered sequentially and that's easy to
72:11
do if all the writes go through a single
72:13
server and extremely difficult if we
72:15
have lots of different servers all sort
72:17
of writing in an uncoordinated way to
72:19
the same database so the writes really
72:21
have to be go through one database but
72:24
we could set up and indeed Amazon does
72:27
set up a situation where we have read
72:29
only database replicas that can read
72:32
from these storage servers and so the
72:35
full glory of figure three is that in
72:38
addition to the main database server
72:40
that handles the write requests there's
72:42
also a set of read-only
72:48
databases and they say they can support
72:50
up to 15 so you can actually get a lot
72:53
of you know if your senior we'd have you
72:56
workload a lot of it can be you know
72:58
most of it can be sort of hived off to a
73:01
whole bunch of these read-only databases
73:02
and when a client sends a read request
73:05
to read only database what happens is
73:07
the read only database figures out you
73:09
know what data pages it needs to serve
73:11
that request and sends reads into the
73:14
directly into the storage system without
73:15
bothering the main readwrite database so
73:21
the the read-only replica database
73:23
ascend page requests read requests
73:25
directly the storage servers and then
73:27
they'll be no cache those pages so that
73:31
they can you know respond to future read
73:33
requests right out of their cache of
73:35
course they need to be able to update
73:36
those caches and for that reason Aurora
73:40
also the main database sends a copy of
73:43
its log to each of the read-only
73:46
databases and that's the horizontal
73:49
lines you see between the blue boxes and
73:51
figure three that the main database
73:52
sends all the log entries do these mean
73:55
only databases which they use to update
73:57
their cached copies to reflect recent
74:03
transactions in the database and it
74:05
means it does mean that the read only
74:07
database is lag a little bit behind the
74:09
main database but it turns out for a lot
74:12
of read-only workloads that's okay if
74:13
you look at a web page and it's you know
74:15
20 milliseconds out of date that's
74:17
usually not a big problem there are some
74:24
complexities from this like one problem
74:26
is that we don't want these relay
74:28
databases to see data from uncommitted
74:31
transactions yet and so in this stream
74:34
of log entries the database may need to
74:36
be sort of denotes which transactions
74:39
have committed and they're read-only
74:42
databases are careful not to apply
74:43
uncommon
74:44
uncommitted transactions to their caches
74:47
they wait till the transactions commit
74:49
the other complexity that these
74:54
read-only replicas impose is that
74:59
the the the these structures he of these
75:03
andhe structures are quite complex this
75:05
might be a b-tree it might need to be
75:06
rebalanced periodically for example I'm
75:09
the rebalancing is quite a complex
75:10
operation in which a lot of the tree has
75:12
to be modified in atomically and so the
75:15
tree is incorrect while it's being be
75:17
balanced and you only allowed to look at
75:19
it after the rebalancing is done if
75:21
these read-only replicas directly read
75:23
the pages out of the database there's a
75:25
risk they might see the be tree that the
75:28
database that's being stored here in
75:30
these data pages they may see the bee
75:31
tree in the middle of a rebalancing or
75:34
some other operation and the data is
75:37
just totally illegal and they might
75:39
crash or just malfunction and when the
75:43
paper talks about mini transactions and
75:45
the vdl verses vcl distinction what it's
75:49
talking about is the machinery by which
75:51
the database server can tell the storage
75:54
servers look this complex sequence of
75:57
log entries must only be revealed all or
76:02
nothing' atomically to any read-only
76:04
transactions that's what the mini
76:07
transactions and VDL are about and
76:09
basically the read when a read only
76:10
database asks to see data a data page
76:13
from a storage server the storage server
76:15
is careful to either show it data from
76:17
just before one of these sequence many
76:20
transaction sequences of log entries or
76:23
just after but not in the middle all
76:28
right so that's the all the technical
76:33
stuff I have to talk about just to kind
76:34
of summarize what's interesting about
76:36
the paper and what can be learned from
76:37
the paper one thing to learn which is
76:41
just good in general not specific to
76:43
this paper but everybody in systems
76:45
should know is the basics of how
76:48
transaction processing databases work
76:50
and the sort of impact that the
76:53
interaction between transaction
76:55
processing databases and the storage
76:58
systems because this comes up a lot it's
77:00
like a pervasive you know the
77:01
performance and crash recoverability
77:05
complexity of running a real database
77:07
just comes up over and over again in
77:10
systems design another thing to learn
77:13
this paper is this idea of quorums and
77:15
overlap the technique of overlapping
77:18
read/write quorums in order to always be
77:20
able to see the latest data but also get
77:22
fault tolerance and of course this comes
77:24
up in raft also raft has a strong kind
77:27
of quorum flavor to it
77:29
another interesting thought from this
77:31
paper is that the database and the
77:33
storage system are basically Co designed
77:35
as kind of an integrated there's
77:37
integration across the database layer
77:39
and the storage layer or nearly
77:41
redesigned to try to design systems so
77:43
they have you know good separation
77:45
between consumers of services and the
77:49
sort of infrastructure services like
77:50
typically storage is very
77:52
general-purpose not aimed at a
77:54
particular application just you know
77:57
because that's a pleasant design and it
78:00
also means that lots of different uses
78:01
can be made of the same infrastructure
78:03
but here the performance issues were so
78:06
extreme you know they would have to get
78:07
a 35 times performance improvement by
78:09
sort of blurring this boundary this was
78:13
a situation in which general-purpose
78:14
storage was actually really not
78:16
advantageous and they got a big win by
78:19
abandoning that idea and a final set of
78:22
things to get out of the papers all the
78:24
interesting sometimes kind of implicit
78:26
information about what was valuable to
78:30
these Amazon engineers who you know
78:32
really know what they're doing about
78:35
what concerns they had about cloud
78:37
infrastructure like the amount of worry
78:41
that they put into the possibility of an
78:42
entire availability zone might fail it's
78:45
an important tidbit the fact that
78:48
transient slowness of individual storage
78:51
servers was important is another thing
78:53
that actually also comes up a lot and
78:57
finally the implication that the network
79:00
is the main bottleneck because after all
79:02
they were it went to extreme lengths to
79:04
send less data over the network
79:06
but in return the storage servers have
79:08
to do more work and they put it they're
79:10
willing to you know 6 copies the data
79:12
and have 6 CPUs all replicating the
79:16
execution of applying these redo log
79:19
entries apparently CPU is relatively
79:21
cheap for them whereas the network
79:24
capacity was extremely important
79:26
all right that's all I have to say and
79:31
see you next week