字幕記錄


00:00
I'd like to like to talk about farms day
00:02
and optimistic concurrency control which
00:06
is the main interesting technique that
00:07
uses the reason we're talking about farm
00:10
it's this the last paper in the series
00:14
about transactions and replication and
00:16
sharding and this is still an open
00:19
research area where people are totally
00:21
not satisfied with performance or in the
00:25
kind of performance versus consistency
00:30
trade-offs that are available and
00:31
they're still trying to do better
00:32
and in particular this particular paper
00:34
is motivated by the huge performance
00:36
potential of these new RDMA NICs
00:41
so you may be wondering since we just
00:43
read about spanner
00:44
how farm differs some spanner both of
00:48
them after all replicate and they use
00:49
two-phase commit for transactions of
00:51
that level they seem pretty similar
00:54
spanner as a is a deployed systems been
00:56
used a lot for a long time its main
01:00
focus is on Geographic replication that
01:02
is to be able to have copies on there
01:04
like east and west coasts and different
01:06
data centers and be able to have
01:09
reasonably efficient transactions that
01:11
involve pieces of data in lots of
01:13
different places and the most innovative
01:16
thing about it because in order to try
01:18
to solve the problem of how long it
01:20
takes to do two-phase commit over long
01:22
distances is that it has a special
01:26
optimization path for read-only
01:27
transactions using synchronized time and
01:32
the performance you get out of spanner
01:34
if you remember is that a read/write
01:36
transaction takes 10 to 100 milliseconds
01:40
depending on how close together the
01:42
different data centers are farm makes a
01:46
very different set of design decisions
01:49
and targets a different kind of workload
01:50
first of all it's a research prototype
01:52
so it's not by any means a finished
01:54
product and the goal is to explore the
01:57
potential of these new RDMA high speed
02:00
networking hardware so it's really still
02:04
an exploratory system it assumes that
02:08
all replicas are in the same data center
02:09
absolutely it doesn't wouldn't make
02:11
sense
02:11
the replicas were in even in different
02:13
data centers let alone on East Coast
02:15
versus West Coast so it's not trying to
02:18
solve a problem that spanner is about
02:20
what happens if an entire data center
02:22
goes down can I so get out my data
02:23
really that's does the extent that it
02:25
has fault tolerance is for individual
02:27
crashes or maybe try to recover after a
02:30
whole data center loses power and gets
02:33
restored again it uses this RDMA
02:37
technique which I'll talk about but
02:39
already may turns out to seriously
02:41
restrict the design options and because
02:43
of this farm is forced to use optimistic
02:46
concurrency control on the other hand
02:49
the performance they get is far far
02:52
higher than spanner farm can do a
02:56
transit a simple transaction in 58
02:58
microseconds and this is from figure 7
03:00
and section 6.3 so this is 58
03:03
microseconds versus to 10 milliseconds
03:06
that the spanner takes is that's about a
03:10
hundred times faster than spanner so
03:12
that's maybe the main huge differences
03:15
that farm us how much higher performance
03:17
but is not aimed at Geographic
03:21
replication so this you know farms
03:26
performance is extremely impressive like
03:28
how much faster than anything else
03:31
another way to look at it is that
03:33
spanner and farm target different
03:35
bottlenecks and span are the main
03:37
bottleneck the people worried about is
03:39
the speed of light and network speed of
03:41
light delays and network leaves between
03:42
data centers whereas in farm the main
03:46
bottlenecks that the design is worried
03:50
about is is CPU time on the server's
03:52
because they kind of wished away the
03:54
speed of light and network delays by
03:55
putting all the replicas in the same
03:57
data center all right
04:01
so sort of the background of how this
04:03
fits into the 684 sequence the setup and
04:08
farm is that you have it's all running
04:10
in one datacenter there's a sort of
04:16
configuration manager this which we've
04:18
seen before and the configuration
04:21
managers in charge of deciding which rep
04:24
which
04:25
servers should be the primary in the
04:26
backup before each shard of data and if
04:30
you read carefully you'll see that they
04:31
use zookeeper in order to help them
04:37
implement this configuration manager but
04:38
it's not not the focus of the paper at
04:40
all
04:40
instead the interesting thing is that
04:42
the data is sharded split up by key
04:44
across a bunch of primary backup payers
04:47
so I mean one shard goes on you know
04:50
primary one server primary one backup
04:52
one another short one primary to backup
04:55
two and so forth and that means that
04:59
anytime you update data you need to
05:01
update it both on the primary and on the
05:03
backup and these are not these primaries
05:06
these replicas are not maintained by
05:08
PACs or anything like it instead all the
05:11
replicas of the data are updated
05:14
whenever there's a change and if you
05:16
read you always have to read from the
05:18
primary the reason for this replication
05:21
of course is fault tolerance and the
05:24
kind of fault tolerance they get is that
05:26
as long as one replicas of a given shard
05:29
is available then that shard will be
05:31
available so they only require one
05:33
living replica not a majority and the
05:37
system as a whole if there's say a data
05:39
center white power failure it can
05:42
recover as long as there's at least one
05:43
replicas of every shard in the system
05:47
another way of putting that is if you
05:49
they have F plus one replicas then they
05:52
can tolerate up to F failures for that
05:54
shard in addition to the primary backup
05:58
copies of each sort of data there's
06:00
transaction code that runs it's maybe
06:04
most convenient to think of the
06:06
transaction code is running as separate
06:07
clients in fact they run the transaction
06:11
code in their experiments on the same
06:13
machines as the actual farm storage
06:16
servers but I'll mostly think of them as
06:21
as being a separate set of clients and
06:24
the clients are running transactions and
06:27
the transactions need to read and write
06:29
data objects that are stored in the
06:34
in the sharded servers in addition these
06:38
transaction these clients each client
06:40
not only runs the transactions but also
06:42
acts as that transaction coordinator for
06:45
two-phase commit
06:48
okay so it's the basic set up the way
06:52
they get performance because this really
06:53
this is a paper all about how you can
06:55
get high performance and still have
06:57
transactions one way they get high
06:59
performances with sharding these are the
07:04
ingredients in a sense the main way is
07:09
through sharding in experiments they
07:12
shard their data over 90 ways for 90
07:14
servers or maybe it's 45 ways and not
07:17
just if as long as the operations and
07:19
different shards are more or less
07:21
independent of each other that just gets
07:23
you an automatic 90 times speed up
07:25
because you can run whatever it is
07:27
you're running in parallel on 90 syrups
07:30
this huge went from shorter sharding um
07:34
another trick they play in order to get
07:36
good performance as the data all has to
07:38
fit in the RAM of the servers they don't
07:41
really store the data on disk
07:44
it all has to fit in RAM and that means
07:46
of course you can get out of pretty
07:46
quickly another way that they get high
07:50
performance is they need to tolerate
07:53
power failures which means that they
07:56
can't just be using RAM because they
07:57
need to recover the data after a power
07:59
failure and RAM loses contents on a
08:02
power failure so they have a clever
08:04
non-volatile Ram scheme for having the
08:09
contents of RAM the data survived power
08:11
failures this is in contrast to storing
08:14
the data persistently on disk i'm is
08:16
much faster than disk um another trick
08:20
they play is they use this RDMA
08:22
technique which essentially clever
08:25
network interface cards that allow that
08:31
accept packets that instruct that then
08:34
that we're interface card to directly
08:35
read and write the memory of the server
08:37
without interrupting the server I know
08:42
that trick they play is what you often
08:45
call kernel bypass
08:48
which means that the application level
08:54
code can directly access the network
08:58
interface card without getting the
09:00
kernel involved okay so these are all
09:02
the sort of clever tricks we're looking
09:05
at out pour it that they used to get
09:07
high performance and I'll talk about
09:09
we've already talked about sharding a
09:11
lot but I'll talk about the rest in this
09:13
lecture
09:15
okay so first I'll talk about
09:19
non-volatile Ram I mean this is really a
09:21
topic that doesn't doesn't really affect
09:27
the rest of the design directly as I
09:31
said all the data and for farm is stored
09:34
in RAM when you update it when a client
09:37
transaction updates a piece of data what
09:38
that really means is it reaches out to
09:39
the relevant servers that store the data
09:41
and causes those servers to modify the
09:43
whatever object is the transaction is
09:46
modifying to object modify it right in
09:49
RAM and that's as far as the writes get
09:51
they don't go to disk and this is you
09:53
know contrast to your raft
09:54
implementations for example which spent
09:56
a lot of time persisting data to disk
09:59
there's no persisting and in farm this
10:04
is a big wind writing stuff in RAM write
10:06
a write to ram takes about 200
10:07
nanoseconds whereas a raid even to a
10:09
solid state drive which is pretty fast a
10:11
right to a stall seek drive takes about
10:14
a hundred microseconds and a write to
10:17
our hard drive takes about ten
10:18
milliseconds so being able to write to
10:21
ram is worth many many orders of
10:24
magnitude and speed for transactions
10:26
that modify things but of course iran
10:27
loses its content and a power failure so
10:30
it's not persistent by itself as a side
10:34
you might think that writing
10:37
modifications to the RAM of multiple
10:41
servers that if you have replica servers
10:43
and you update all the replicas that
10:45
that might be persistent enough and so
10:48
after all if you have F 1 F +1 replicas
10:50
you can tolerate up to F failures and
10:53
the reason why just simply writing to
10:55
Ram on multiple servers is not good
10:57
enough is that a site-wide power failure
10:59
will destroy
11:00
all of your servers and thus violating
11:06
the assumption that the failures are in
11:09
different servers are independent so we
11:11
need a scheme that it's gonna work even
11:12
if power fails to the entire data center
11:16
so what what forum does is it it puts a
11:24
battery a big battery in every rack and
11:26
runs the power supply system through the
11:28
batteries so the batteries automatically
11:31
take over if there's a power failure and
11:32
keep all their machines running at least
11:34
until the battery fails but of course
11:37
you know the battery is not very big it
11:39
may only be able to run their their
11:41
machines for say 10 minutes or something
11:44
so the battery by itself is not enough
11:45
to make this the system be able to
11:47
withstand a lengthy power failure so
11:50
instead the battery system when it sees
11:53
that the main power is failed the
11:56
battery system while it keeps the
11:57
server's Marling also alerts the
12:00
server's all the servers and with some
12:02
kind of interrupt or message telling
12:04
them look the powers just failed you
12:06
know you only got 10 minutes left before
12:09
the batteries fail also so at that point
12:12
the software on farms servers copies all
12:16
of rain active stops all processing it
12:18
for farm first and then copies each
12:21
server copies all of its RAM to a
12:23
solid-state drive attached to that
12:25
server I'm what wished could take a
12:27
couple minutes and once all the RAM is
12:30
copied to the solid-state drive then the
12:32
machine shuts itself down and turns
12:33
itself off so if all goes well there's a
12:37
site-wide power failure all the machines
12:39
save their RAM to disk when the power
12:43
comes back up in the datacenter all the
12:45
machines will when they reboot will read
12:49
the memory image that was saved on disk
12:51
restored into RAM and but there's some
12:54
recovery that has to go on but basically
12:57
they won't have lost any of their
12:58
persistent state due to the power
13:00
failure and so what that really means is
13:03
that the farm is using conventional Ram
13:07
but it's essentially made the RAM
13:10
non-volatile being able to survive power
13:13
failures with the
13:14
this trick of using a battery having a
13:17
battery alert the server having the
13:18
server store the RAM content solid-state
13:21
drives any questions about the nvram
13:26
scheme alright this is a is a useful
13:33
trick but it is worthwhile keeping mind
13:37
that it really only helps if there's
13:40
power failures that is if the you know
13:44
the whole sequence of events only it
13:46
gets set in train when the battery
13:47
notices that the main power is failed if
13:50
there's some other reason
13:51
causing the server to fail like
13:53
something goes wrong with the hardware
13:55
or there's a bug in the software that
13:57
causes a crash those crashes the
14:00
non-volatile Ram system is just nothing
14:02
to do with those crashes those crashes
14:04
will cause the machine to reboot and
14:06
lose the contents of its RAM and it
14:08
won't be able to recover them so this
14:10
NVRAM scheme is good for power failures
14:14
but not other crashes and so that's why
14:15
in addition to the NVRAM farm also has
14:19
multiple copies multiple replicas of
14:21
each shard all right so this NVRAM
14:26
scheme essentially eliminates
14:28
persistence rates as a bottleneck in the
14:32
performance of the system leaving only
14:35
as performance bottlenecks the network
14:37
and the CPU which is what we'll talk
14:39
about next ok so there's a question if
14:43
the datacenter power fails and farm lose
14:49
everything for solid-state drive would
14:52
it be possible to carry all the data to
14:53
a different data center and continue
14:55
operation there in principle absolutely
15:01
in practice I think would be would all
15:05
certainly be easier to restore power to
15:07
the data center then to move the drives
15:10
the problem is there's no power and the
15:12
power in the dated old data center so
15:14
you'd have to physically move the drives
15:16
and the computers maybe just the drives
15:19
to the new data center so this was if
15:21
you wanted to do this it might be
15:23
possible but it's certainly not
15:27
it's not what the farm designers had in
15:29
mind they assumed the power be restored
15:33
okay so that's NVRAM and at this point
15:38
we can just ignore nvram for the rest of
15:40
the design it doesn't it doesn't really
15:45
interact with the rest of the design
15:46
except that we know we're have to worry
15:48
about writing data to disk all right so
15:54
as I mentioned the remaining bottlenecks
15:56
once you eliminate having a great data
15:59
to disk for persistence in remaining
16:00
bottlenecks have to do with the CPU and
16:02
the network in fact in farman and indeed
16:07
a lot of the systems that i've been
16:09
involved with the a huge bottleneck has
16:13
been the cpu time required to deal with
16:16
network interactions so now we're can
16:18
CPU are kind of joint bottlenecks here
16:21
farm doesn't have any kind of speed of
16:23
light network problems it just has the
16:27
problems or it just spends a lot of time
16:30
eliminating bottlenecks having to do is
16:31
getting network data into and out of the
16:34
computers so first as a background I
16:38
want to lay out what the conventional
16:40
architecture is for getting things like
16:43
remote procedure call packets between
16:46
applications and on different computers
16:51
just so that can we have an idea of why
16:54
this approach that farm takes is more
16:56
efficient so typically what's going on
16:58
is on one computer that maybe wants to
17:03
send a procedure call message you might
17:05
have an application and then the
17:09
application is running in user space
17:11
there's a user kernel boundary here the
17:15
application makes system calls into the
17:17
kernel which are not particularly cheap
17:19
in order to send data and then there's a
17:22
whole stack of software inside the
17:24
kernel involved is sending data over the
17:26
network there might be what's usually
17:29
called a socket layer that does
17:32
buffering which involves copying the
17:36
data which takes time there's typically
17:38
a complex TCP
17:40
the protocol stack that knows all about
17:42
things like retransmitting and sequence
17:45
numbers and check sums and flow control
17:49
there's quite a bit of processing there
17:51
at the bottom there's a piece of
17:54
hardware called the network interface
17:57
card which is has a bunch of registers
18:00
that the kernel can talk to to configure
18:04
it and it has hardware required to send
18:07
bits out over the cable onto the network
18:09
and so there's some sort of network
18:11
interface card driver in the kernel and
18:15
then all self respecting that we're
18:18
gonna price cards use direct memory
18:19
access to move packets into and out of
18:22
host memory so there's going to be
18:23
things like queues of packets that the
18:26
network interfaces card has D made into
18:28
memory the waiting for the kernel to
18:30
read and outgoing hues the packets that
18:33
the kernel would like then that we're
18:34
going to face to car to send as soon as
18:36
convenient all right so you want to send
18:39
a message like an RPC request let's go
18:41
down from the application through the
18:43
stack network interface card sends the
18:45
bits out on a cable and then there's the
18:48
reverse stack on the other side isn't
18:51
network interface Hardware here in the
18:55
kernel then organ or face might
18:57
interrupt the kernel kernel runs driver
19:00
Code which hands packets to the TCP
19:02
protocol which writes them into buffers
19:06
waiting for the application to read them
19:08
at some point the application gets
19:10
around reading them makes system calls
19:12
into the kernel copies the data out of
19:15
these buffers into user space this is a
19:19
lot of software it's a lot of processing
19:21
and a lot of fairly expensive CPU
19:24
operations like system calls and
19:26
interrupts and copying data as a result
19:29
so classical Network communication is
19:32
relatively slow it's quite hard to build
19:35
an RPC system with the kind of
19:37
traditional architecture that can
19:39
deliver more than say a few hundred
19:41
thousand or BC messages per second and
19:45
that might seem like a lot but it's
19:47
orders of magnitude too few for the kind
19:49
of performance that farm is trying to
19:51
target and in general that
19:53
couple hundred thousand our pcs per
19:54
second is far far less than the speed
19:58
that the actual network hardware like
20:01
Network wire in the network interface
20:03
card is capable of typically these
20:05
cables run at things like 10 gigabits
20:07
per second it's very very hard to write
20:11
our PC software that can generate small
20:12
messages of the kind that databases
20:16
often need to use it's very hard to
20:18
write software in this style that can
20:20
generate or absorb anything like 10
20:24
gigabits per second of messages that's
20:30
millions maybe tens of millions of
20:32
messages per second ok so this is the
20:34
plan that farm doesn't use and a sort of
20:37
a reaction to to this plan instead farm
20:45
uses - - ideas to reduce the costs of
20:52
pushing packets around the first one
20:55
I'll call kernel bypass and the idea
21:01
here is that instead of the application
21:05
sending all its data down through a
21:06
complex stack of kernel code instead the
21:15
application the kernel configures the
21:18
protection machinery in the computer to
21:21
allow the application direct access to
21:24
network interface card so the
21:26
application can actually reach out and
21:28
touch the network interfaces registers
21:30
and tell it what to do in addition the
21:33
network interface card when it DMAs
21:35
and this kernel bypass scheme it DNA's
21:38
directly into application memory where
21:41
the application can see the bytes
21:42
arriving directly without kernel
21:46
intervention and when the application
21:47
needs to send data the application can
21:49
create queues that the network interface
21:53
card can directly read with DMA and send
21:56
out over the wire so now we've
21:58
completely eliminated all the kernel
22:01
code involved in networking kernels just
22:03
not involved there's no system calls
22:04
there's no interrupts
22:06
the application just directly reason why
22:08
it's memory that the network interface
22:09
card sees and of course the same thing
22:11
on the other side and this is a this is
22:21
an idea that is actually was not
22:25
possible years ago with network
22:27
interface cards but most modern serious
22:30
network interface cards okay can be set
22:33
up to do this it does however require
22:34
that the application you know you know
22:37
all those things that TCP was doing for
22:39
you like check sums or retransmission
22:43
the application would now be in charge
22:46
if we wanted to do this you can actually
22:49
do this yourself kernel bypass using a
22:55
toolkit that you can find on the way up
22:57
called
22:57
DP DK and it's relatively easy to use
23:02
and allows people to write extremely
23:04
high performance networking applications
23:08
but and so so form does use this it's
23:12
applications directly you talk to the
23:14
neck the neck DM ace things right into
23:15
application memory we have a student
23:19
question I'm sorry yes does this mean
23:22
that farm machines run a modified
23:24
operating system well I I don't know the
23:31
actual answer that question I believe
23:32
farm is runs on Windows some form of
23:36
Windows whether or not they had to
23:38
modify Windows I do not know in the sort
23:44
of Linux world in Linux world there's
23:46
already full support for this
23:48
it does require kernel intervention
23:50
because the kernel has to be willing to
23:54
give ordinarily application code cannot
23:56
do anything directly with devices so
23:58
Linux has had to be modified to allow
24:01
the allow the kernel to delegate
24:05
hardware access to applications so it
24:08
does require kernel modifications those
24:12
monitor occasions are already in Linux
24:13
and maybe already in Windows also in
24:16
addition though this
24:18
on fairly intelligent Knicks because of
24:21
course you're going to have multiple
24:22
applications that want to play this game
24:23
with a network interface card and so
24:25
modern NICs actually know about talking
24:27
to multiple distinct cues so that you
24:30
can have multiple applications each with
24:32
its own set of cues and the the Nick
24:33
knows about so it did it has required
24:36
modification of a lot of things okay
24:41
so sort of step one is is Colonel bypass
24:47
idea step two is even cleverer next and
24:50
now we're starting to get into hardware
24:51
that is not in wide use of the moment
24:54
you can buy it commercially but it's not
24:59
the default is this RDMA scheme which is
25:03
remote direct memory access and here
25:11
this is sort of special kind of network
25:18
interface cards that support remote
25:21
support our DMA so now we have an RDM a
25:28
neck both sides have to have these
25:35
special network interface cards so I'm
25:38
drawing these is connected by a cable in
25:39
fact always there's a switch here that
25:43
has connections to many different
25:47
servers and allows any server to talk to
25:50
any server okay so we have these RDMA
25:52
necks and we had again we have the
25:55
applications and application assist
25:56
memory and now though the application
26:02
can essentially send a special message
26:06
through the neck that asks so we have a
26:10
an application on the source host and
26:13
maybe we wouldn't call this the
26:15
destination host can send a special
26:18
message through the our DMA system that
26:22
tells this network interface card to
26:24
directly read or write a byte some bytes
26:28
of memory probably a cache line of
26:30
memory in
26:32
the target applications address space
26:33
directly so hardware and software on the
26:36
network interface controller are doing a
26:38
read and write read or write of the
26:40
application target applications memory
26:41
directly and then so we have a sort of
26:44
request going here that causes the read
26:47
or write and then sending the result
26:49
back to really two other incoming queue
26:55
on the source application and the cool
26:58
thing about this is that this computer's
27:01
the CPU this application didn't know
27:03
anything about the read or write the
27:06
read or write is executed completely in
27:09
firmware in the network interface card
27:12
so it's not there's no interrupts here
27:13
the application didn't have to think
27:15
about the request or think about
27:16
replying network interface card just
27:18
reads or writes a memory and sends a
27:20
result back to the source application
27:22
and this is much much lower overhead way
27:25
of getting at of all you need to do is
27:27
read or write memory and stuff in the
27:30
RAM of the target application this is a
27:32
much faster way of doing a simple read
27:35
or write than sending in our PC call
27:37
even with magic kernel bypass networking
27:42
it's a question does this mean that
27:46
already may always require kernel bypass
27:48
to work at all you know I don't know the
27:54
answer to that I think I've only ever
27:56
heard it used in conjunction with kernel
27:59
bypass cuz you know the people who are
28:02
interested in any of this or are
28:04
interested in it only for tremendous
28:07
performance and I think you would waste
28:10
you throw away a lot of the performance
28:12
I'm guessing you throw away a lot of the
28:14
performance win if you had to send the
28:16
requests through the kernel okay another
28:22
question that the the question notes
28:31
TCP software's TCP supports in order
28:34
delivery duplicate detection and a lot
28:36
of other excellent properties which you
28:38
actually need and so it would actually
28:41
be extremely awkward if this setup
28:44
sacrificed reliable delivery or in order
28:48
delivery and so the answer the question
28:50
is actually these are DMA NICs run their
28:53
own reliable sequenced protocol that's
28:56
like TCP although not TCP between the
28:59
necks and so when you ask your already
29:02
am a neck to do a read or write it'll
29:04
keep you transmitting until if the you
29:08
know if the request is lost and keep
29:09
reassurance meaning till it gets a
29:10
response and it actually tells the
29:12
originating software did the request
29:15
succeed or not so you get an
29:17
acknowledgment back finally so yeah you
29:20
know in fact have to sacrifice
29:22
most of TCP is good properties now this
29:25
stuff only works over a local network I
29:28
don't believe our DMA would be
29:32
satisfactory like between distant data
29:34
centers so there's all tuned up for very
29:38
low speed of light access okay a
29:45
particular piece of jargon that the
29:47
paper uses is one-sided our DMA and
29:56
that's basically what I've just
29:57
mentioned when application uses our DMA
30:01
to read or write the memory of another
30:02
that's one site our DMA now in fact farm
30:06
uses our DMA to send messages in an RPC
30:13
like protocol so in fact sometimes farm
30:15
directly reads with one sided our DMA
30:18
but sometimes what farm is using our DMA
30:20
for is to append a message to an
30:23
incoming message queue inside the target
30:26
so sometimes what the what the
30:27
well actually always with writes what
30:31
farm is actually doing is using our DMA
30:33
to write to append a new message to an
30:38
incoming queue in the target which the
30:40
target will pull since there's nobody
30:42
interrupts here the way target
30:45
the way the destination of a message
30:47
like this knows I got the messages that
30:49
periodically checks one of these keys
30:52
queues and memory to see how have I
30:53
gotten a recent message from anyone
30:55
okay so once I did already MA is just to
30:58
read or write but using our DMA to send
31:00
a message or append either to a message
31:02
queue or to a log
31:03
sometimes farm appends messages or log
31:06
entries to a log and another server also
31:09
uses our DMA and you know this memory
31:11
that's being written into is all
31:15
non-volatile so all of it the message
31:17
queues it's all written to disk if
31:20
there's a power failure the performance
31:24
of this is the figure 2 shows that you
31:29
can get 10 million small our DMA reads
31:34
and writes per second which is fantastic
31:37
far far faster than you can send
31:40
messages like our pcs using TCP and the
31:43
latency of using our DMA to do a simple
31:46
read or write is about 5 microseconds so
31:48
again this is you know very very short 5
31:54
microseconds is it's slower than
31:56
accessing your own local memory but it's
31:59
faster than sort of anything else people
32:01
do in networks ok so this is sort of a
32:05
promise there's this fabulous our DMA
32:07
technology that came out a while ago
32:10
that at the farm people wanted to
32:11
exploit you know the coolest possible
32:15
thing that you could imagine doing with
32:17
this is using our DMA one sign it
32:20
already am a reason rights to directly
32:23
do all the reason writes a records
32:26
stored in database servers memory so
32:28
wouldn't be fantastic if we could just
32:29
never talk to the database server CPU or
32:32
software but just get at the data that
32:34
we need you know in five microseconds a
32:37
pop using direct one-sided our DMA
32:40
Reiser writes so in a sense this paper
32:43
is about you know you you start there
32:45
what do you have to do to actually build
32:49
something useful so an interesting
32:53
question by the way is could you in fact
32:56
implement transactions
32:58
using one-sided RDMA that is you know
33:01
anything we wanted to read or write data
33:03
in server the only use already may and
33:07
never actually send messages that have
33:10
to be interpreted by the server software
33:13
it's worth thinking about
33:16
in a sense farm is answering that
33:19
question with a no because that's not
33:22
really how farm works but but it is
33:25
absolutely worth thinking how come
33:28
pure one-sided RDMA couldn't be made to
33:31
work alright so the challenges to using
33:37
our DMA in a transactional system that
33:42
has replication and sharding so that
33:46
that's the challenge we have is how to
33:48
combine already made with transactions
33:50
charting and replication because you
33:52
need to have sharding and transactions
33:54
replication to have a seriously useful
33:56
database system it turns out that all
33:59
the protocols we've seen so far for
34:02
doing transactions replication require
34:04
active participation by the server
34:06
software that is the server has to be in
34:11
all the protocols we've seen so far the
34:13
server's actively involved in helping
34:15
the clients get at read or write the
34:17
data so for example in the two-phase
34:21
commit schemes we've seen the server has
34:23
to do things like decide whether a
34:25
record is locked and if it's not walk
34:27
set the lock on it right it's not clear
34:31
how you could do that with our DMA the
34:35
server has to do things like in spanner
34:37
you know there's all these versions it
34:39
was the server that was thinking about
34:40
how to find the latest version similarly
34:43
if we have transactions in two-phase
34:45
commit
34:45
data on the server it's not just data
34:48
there's committed data there's data
34:50
that's been written but hasn't committed
34:52
yet and again traditionally it's the
34:54
server that sorts out whether data
34:58
recently updated data is committed yet
35:00
and that's to sort of protect the
35:01
clients from you know prevent them from
35:03
seeing data that's locked or not yet
35:06
known to be committed and what that
35:08
means is that without some clever
35:10
thought
35:12
RDMA or one-sided pure use of our DME
35:15
one-sided RDMA doesn't seem to be
35:18
immediately compatible with transactions
35:21
and replication and indeed farm while
35:26
farm does use one-sided it reads to get
35:29
out directly at data in the database it
35:32
is not not able to use one-sided rights
35:34
to modify the data okay so this leads us
35:43
to optimistic concurrency control it
35:46
turns out that the main trick in a sense
35:54
that farm uses to allow it both use RDMA
35:59
and get transactions is by using
36:02
optimistic concurrency control so if you
36:05
remember I mentioned earlier that
36:12
concurrency control schemes are kind of
36:14
divided into two broad categories
36:18
pessimistic and optimistic pessimistic
36:23
schemes use locks and the idea is that
36:28
if you have a transaction that's gonna
36:30
read or write some data before you can
36:32
read or write the data or look at it at
36:34
all it must acquire a lock and it must
36:36
wait for the lock and so you read about
36:41
two-phase locking for example in that
36:45
reading from 633 so before you use data
36:48
you have to lock it and you hold the
36:50
lock for the entire duration of the
36:52
transaction and only if the transaction
36:54
commits or aborts do you release the
36:56
lock and if there's conflicts because
37:01
two transactions want to write the same
37:05
data at the same time or one wants to
37:06
read and one that monster right they
37:08
can't do it at the same time one of them
37:10
has to block or all but one of the
37:12
transactions that went to you some data
37:13
missed a block wait for the lock to be
37:15
released um and of course this locking
37:17
scheme is the fact that the data has to
37:20
be locked and that somebody has to keep
37:21
track of who owns the lock and when the
37:23
lock is released etcetera
37:27
this is one thing that makes our DMA
37:30
it's not clear how you can do rights or
37:33
even reads using our DMA in a locking
37:35
scheme because somebody has to enforce
37:37
the locks I'm being a little tentative
37:40
about this because I suspect that with
37:42
more clever our DMA NICs that could
37:45
support a wider range of operations like
37:49
atomic test and set
37:51
you might someday be able to do a
37:54
locking scheme with pure one-sided RDMA
37:58
but farm doesn't do it okay so what farm
38:02
actually uses as an optimistic scheme
38:04
and here in an optimistic scheme you can
38:08
use at least you can read without
38:12
locking you just read the data you don't
38:18
know yet whether you are allowed to read
38:20
the data or whether somebody else is in
38:21
the model middle of modifying it or
38:23
anything you just read the data and a
38:25
transaction it uses what it whatever it
38:27
happens to be and you also don't
38:30
directly write the data in optimistic
38:33
schemes instead you buffered so you
38:35
buffer writes locally and in the client
38:38
until the transaction finally ends and
38:41
then when the transaction finally
38:44
finishes and you want to try to commit
38:46
it there's a validate what's called a
38:50
validation stage in which the
38:56
transaction processing system tries to
38:58
figure out whether the actual reason
39:00
rights you did were consistent with
39:02
serializability that is they try to
39:04
figure out oh was somebody writing the
39:06
data while I was reading it and if they
39:07
were boy
39:08
we can't commit this transaction because
39:10
it computed with garbage instead of
39:13
consistent read values and so if the
39:18
validation succeeds then you commit and
39:21
if the validation doesn't succeed if you
39:24
detect somebody else was messing with
39:25
the data while you were trying to use it
39:26
at abort so that means that if there's
39:29
conflicts if you're reading or writing
39:33
data and some other transactions also
39:35
modifying at the same time
39:38
optimistic schemes abort at that point
39:41
because the computation is already
39:43
incorrect at the commit point that is
39:45
you already read the damage data you
39:47
weren't supposed to read so there's no
39:49
way to for example block you know until
39:52
things are okay
39:53
instead the transactions already kind of
39:56
poisoned and just has to abort and
39:57
possibly be try okay so farm uses
40:03
optimistic because he wants to be able
40:05
to use one-sided RDMA to just read
40:08
whatever's there very quickly so this
40:12
this design was really forced by use of
40:16
our DMA this is often abbreviated OCC
40:19
for optimistic concurrency control all
40:26
right and then the interesting thing an
40:27
optimistic concurrency control protocols
40:28
is how validation works how do you
40:31
actually detect that somebody else was
40:33
writing the data while you were trying
40:35
to use it and that's actually mainly
40:38
gonna be what I talked about in the rest
40:39
of this lecture and just again though
40:42
just to retire this back to the top
40:44
level of the design what this is doing
40:47
for farm is that the reads can use
40:49
one-sided RDMA because and therefore be
40:56
extremely fast because we're gonna check
40:59
later whether the reads were okay all
41:07
right farms a research prototype it
41:10
doesn't support things like sequel it
41:16
supports a fairly simple API for
41:20
transactions this is the API just to
41:23
give you a tease for what a transaction
41:26
code might actually look like if you
41:29
have a transaction it's gotta to clear
41:31
the start of the transaction because we
41:32
need to say oh this particular set of
41:34
Reason rights needs to occur as a
41:36
complete transaction the code declares a
41:40
new transaction by calling TX create
41:43
this is all laid out by the way in the
41:45
paper I think from 2014 a slightly
41:47
earlier paper by the same authors
41:51
you create a new transaction and then
41:53
you explicitly read those functions to
41:55
read objects and you have to supply an
42:02
object identifier an OID indicating what
42:06
object you want to read then you get
42:08
back some object and you can modify the
42:10
object in local memory and we didn't
42:12
write it you have a copy of it that
42:14
you've read from the server the TX read
42:16
back from the server so you know you
42:18
might increment some field in the object
42:22
and then when you want to update an
42:26
object you call this TX right and again
42:30
you give it the object ID and the new
42:34
object contents and finally when you're
42:37
through with all of this you've got to
42:39
tell the assistant to commit this
42:41
transaction actually do validation and
42:44
if it succeeds cause the rights to
42:46
really take effect and be visible and
42:48
you call this commit routine the
42:53
community team runs a whole bunch of
42:54
stuff in figure 4 which we'll talk about
42:56
and it returns this okay value and it's
42:59
required to tell the application oh did
43:02
the commit succeed or was it aborted so
43:04
we need the return this okay return
43:07
valued you know correctly indicate by
43:09
the transaction succeeded okay there's
43:13
some questions one is question since OCC
43:16
aborts if there's contention question is
43:20
whether retries involve exponential
43:23
back-off because otherwise it seems like
43:25
if you just instantly be tried and that
43:29
there were a lot of transactions all
43:32
trying to update the same value at the
43:34
same time they'd all aboard they'd all
43:36
retry and waste a lot of time and I
43:39
don't know the answer to that question I
43:40
don't remember seeing them mentioning
43:42
exponential back-off in the paper but it
43:44
would make a huge amount of sense to
43:46
delay between retries and to increase
43:49
the delay to give somebody a chance of
43:52
succeeding this is much like the
43:56
randomization of the raft collects
43:59
Tanner's another question is the farm
44:04
API closer in spirit to a no sequel
44:05
database yeah you know that's one way of
44:09
viewing it it really that it doesn't
44:12
have any of the fancy query stuff like
44:15
joins for example that sequel has it's
44:19
really a very low-level kind of
44:20
readwrite interface plus the transaction
44:25
support so you you can sort of view it
44:27
as a no sequel database maybe with with
44:30
transactions all right this is what a
44:36
transaction looks like and these are all
44:40
these are library calls created
44:41
read/write commit commit as a sort of
44:44
complex write recall that actually runs
44:46
the transaction coordinator code first
44:49
what a rare variant of two-phase commit
44:52
this described in figure four just
44:56
repeat that the while the recall goes
44:59
off and actually reads the relevant
45:00
server the right call just locally
45:04
buffers then the new the modified object
45:08
and it's only in commit that the objects
45:10
are sent to the servers these object IDs
45:14
are actually compound identifiers for
45:17
objects and they contain two parts one
45:19
is the identify a region which is that
45:25
all the memory of all the servers is
45:27
split up into these regions and the
45:29
configuration manager sort of tracks
45:31
which servers replicate which region
45:34
number so there's a reason number in
45:36
here and then and then you you know you
45:40
client can look up in a table the
45:42
current primary and backups for a given
45:44
region number and then there's an
45:46
address such as the straight memory
45:47
address within that region and so the
45:53
client uses the reason number to pick
45:55
the primary in the backup to talk to and
45:57
then it hands the address to the our DMA
46:00
NIC and tells it look please read at
46:03
this address in order to get fetch this
46:05
object
46:11
alright another piece of detail we have
46:14
to get out of the way is to look at the
46:17
server memory layout I'm in any one
46:25
server there's a bunch of stuff in
46:29
memory so one part is that the server
46:33
has in its memory its if it's it's
46:37
replicating one or more regions it has
46:39
the actual regions and or what a reason
46:41
contains is a whole bunch of these
46:43
objects and each object there's a lot of
46:47
objects objects sitting in memory each
46:52
object has in it a header which contains
46:58
the version number so these are
47:00
versioned objects but each object only
47:03
has one version at a time so this is
47:07
version number and in the high bit let
47:13
me try again here and the high bit of
47:15
each version number is a lock flag so in
47:17
the header of an object there's a lock
47:19
flag and the high bit and then a version
47:21
number in a little bit and then the
47:26
actual data of the object so each object
47:29
has the same servers memory it's the
47:31
same layout a lock bit in the high bit
47:33
and the current version number a little
47:37
bit and every time the system writes
47:39
modifies an object it increases the
47:41
version number and let's see how the
47:42
lock bits are used in a couple minutes
47:44
in addition in the server's memory there
47:47
are pairs of cue pairs of message queues
47:52
and logs one for every other computer in
47:58
the system so that means that you know
48:03
if there's four other servers in the
48:09
system that are running or if there's
48:11
four servers that are running
48:12
transactions there's going to be four
48:15
logs sitting in memory that can be
48:18
appended to with our DMA one for each of
48:21
the other servers and that means that
48:23
one for each of the other computers can
48:25
run transactions so that means that the
48:27
transaction code running on you know so
48:31
number of them you know it's the
48:33
transaction code running on computer to
48:36
when it wants to talk to this server and
48:39
append to its log which as well see it's
48:42
actually going to append to server twos
48:45
log in this servers memory so there's a
48:47
total N squared of these queues floating
48:49
around in in in each servers memory and
48:54
it certainly seems like there's actually
48:56
one set of logs which are meant to be I
48:59
would non-volatile and then also
49:03
possibly a separate set of message
49:06
queues which are used just for more RPC
49:09
like communication again one in each
49:12
server one queue message incoming
49:14
message cube per other server written
49:16
with our DMA writes all right actually
49:28
the next thing to talk about is a year
49:31
four in the paper
49:33
this is feet four and this explains the
49:39
occ commit protocol that farm uses and
49:46
I'm gonna go through mostly steps one by
49:48
one and actually to to begin with I'm
49:51
gonna focus only on the concurrency
49:52
control part of this it turns out these
49:55
steps also do replication as well as
49:58
implement serializable transactions but
50:01
we'll talk about the replication for
50:04
fault tolerance a little bit later okay
50:07
so the first thing that happens is the
50:08
execute phase and this is the TX reads
50:11
and TX writes the reason writes that the
50:13
client transaction is doing and so each
50:17
of these arrows here what this means is
50:19
that the transaction runs on computer C
50:21
and when
50:22
needs to read something it uses
50:26
one-sided RDMA we to simply read it out
50:29
of the relevant primary servers memory
50:31
so what we got here was a primary backup
50:34
primary backup primary backup for three
50:37
different shards and we're imagining
50:39
that our transaction read something from
50:42
one object from each of these shards
50:44
using one-sided RDMA reason that means
50:47
these blindingly fast five microseconds
50:50
each okay so the client reads everything
50:54
it needs to read for the transaction
50:56
also anything that's going to write it
50:58
first reads and it has to do it do this
51:01
read has to first read because it needs
51:04
to get the version number
51:05
the initial version number all right so
51:08
that's the execute phase then when the
51:11
transaction calls TX commits to indicate
51:14
that it's totally done the library on
51:18
the you know the TX commit call on on
51:21
the client acts as a transaction
51:22
coordinator and runs this whole protocol
51:26
which is a kind of elaborate version of
51:28
two-phase commit the first phase and
51:35
that's described in terms of rounds of
51:38
messages so the transaction coordinator
51:40
sends a bunch of LOC messages and wait
51:41
for them to reply and then validate
51:43
messages and waste for the all the
51:45
replies so the first phase in the commit
51:48
protocol is the lock fees in this phase
51:51
what the client is sending is it sends
51:55
to each primary the identity of the
52:00
object in for each object for clients
52:02
written and needs to send that updated
52:03
object to the relevant primary so it
52:06
sends the updated objects the primary
52:09
and as an as a new log entry in the
52:15
primaries log you know for this client
52:18
so the client really abusing already
52:20
made to append to the primaries log and
52:22
what it's appending is the object ID of
52:27
the writ of the object wants to write
52:28
the version number that the client
52:30
initially read when it read the object
52:33
and the new value
52:36
so it appends the object of yours number
52:38
and new value to the primary logon for
52:42
the primary beach of the charge that
52:43
it's written an object in so these I
52:48
guess what's going on here is that the
52:51
this transaction wrote two different
52:53
objects one on primary one and the other
52:55
on primary to know when this is done
52:57
when the transaction coordinator gets
52:59
back the well alright so now the these
53:05
new log records are sitting in the logs
53:07
of the primaries the primary though has
53:09
to actually actively process these log
53:11
entries because it needs to check and
53:14
they sort of do a number of checks
53:15
involved with validation to see if the
53:18
if this primary is part of the
53:20
transaction can be allowed to commit so
53:23
at this point we have to wait for each
53:25
primary to to poll the this clients log
53:30
in the primaries memory see that there's
53:32
a new log entry and process that new log
53:34
entry and then send a yes-or-no vote to
53:39
say whether it is or is not willing to
53:40
do its part of the transaction all right
53:43
so what does the primary do when it's
53:46
polling loop sees that an incoming lock
53:51
log entry from a client first of all if
53:56
that object with the object ID is
53:58
currently blocked then the primary
54:02
rejects this lock message and sends back
54:05
a message to the client using RDMA
54:07
saying no that this transaction cannot
54:09
be allowed to proceed I'm voting no and
54:11
two-phase commit and that will cause the
54:13
transaction coordinator to abort the
54:14
transaction and the other is not locked
54:18
then the next thing the primary does is
54:21
check the version numbers it checks to
54:22
make sure that the version number that
54:24
the client sent it that is the version
54:26
number of the client originally read is
54:28
unchanged and if the version numbers
54:31
changed that means that between when our
54:34
transaction read and when it wrote
54:35
somebody else wrote the object if the
54:38
version numbers changed and so the
54:39
version numbers changed again the
54:41
primary will respond no and forbid the
54:44
transaction from continuing but if the
54:47
version number is the same in the lock
54:48
that's not set
54:51
and the primary will set the lock and
54:57
return a positive response back to the
55:00
client now because the primary's
55:06
multi-threaded running on multiple CPUs
55:08
and there may be other transactions
55:10
there may be other CPUs reading other
55:13
incoming log cues from other clients at
55:16
the same time on the same primary there
55:18
may be races between different
55:20
transactions or lock the clock record
55:23
processing from different transactions
55:26
trying to modify the same object so the
55:29
primary actually uses an atomic
55:31
instruction a compare and swap
55:33
instruction in order to both check check
55:39
the version number and lock and set the
55:42
lock a bit on that version number as an
55:44
atomic operation and this is the reason
55:46
why the lock of it has to be in the high
55:48
bits of the version number so that a
55:49
single instruction can do a compare and
55:55
swap on the version number and the lock
55:57
bit okay now one thing to note is that
56:02
if the objects already locked
56:05
there's no blocking there's no waiting
56:07
for the lock to be released the primary
56:09
simply sends back a know if some other
56:12
transaction has it locked alright any
56:15
questions about the lock fees of of
56:18
Committee all right back in the trend
56:23
head in the client which is acting his
56:25
transaction coordinator it waits for
56:26
responses from all the primaries from
56:28
the primaries of the shard so for every
56:31
object that the transaction modified if
56:34
any of them say no if they need them
56:37
reject the transaction then the
56:38
transaction coordinator aborts the whole
56:39
transaction and actually sends out
56:42
messages to all the primaries saying I
56:43
changed my mind I don't want to commit
56:46
this transaction after all but if they
56:48
all answered yes of all the primaries
56:50
answer yes then the transaction
56:54
coordinator thinks that decides that the
56:56
transaction can actually commit but the
57:00
primaries of course don't know whether
57:01
they all voted yes
57:02
or not so the transaction coordinator
57:05
has to notify every ball the primary so
57:07
yes deed everybody voted yes so please
57:10
do actually commit this and the way the
57:14
client does this is by appending another
57:17
record to the logs of the primaries for
57:20
each modified object this time it's a
57:23
commit backup record that it's a pending
57:26
and the this time the transaction
57:30
coordinator I'm sorry I did commit
57:33
primary I'm skipping over valide didn't
57:35
commit backup for now I'll talk about
57:37
those later so just ignore those for the
57:39
moment the transaction coordinator goes
57:42
on to commit primary sends pens that
57:44
commit primary to each primaries log and
57:47
the transaction coordinator only has to
57:49
wait for the hardware RDMA
57:51
acknowledgments it doesn't have to wait
57:54
for the primary just actually process
57:58
the log record the transaction
58:01
coordinator it turns out as soon as it
58:02
gets a single acknowledgment from any of
58:04
the primaries it can return yes the okay
58:08
equals true to the transactions
58:10
signifying that the transaction six
58:12
succeeded and then there's another stage
58:16
later on where the once the transaction
58:20
coordinator knows that every primary
58:21
knows that the transaction coordinated
58:24
committed you can tell all the primaries
58:29
that they can discard all the log
58:30
entries for this transaction okay now
58:35
there's one last thing that has to
58:37
happen the primaries which are looking
58:40
at the logs their polling the Long's
58:42
they'll notice that there's a commit
58:44
primary record at some point and then on
58:46
the primary that receives the commit
58:49
primary log entry will it knows that it
58:53
had locked that object previously and
58:58
that the object must still be locked so
58:59
what the primary will do is update the
59:01
object in its memory with the new
59:03
contents that were previously sent in
59:05
the lock message I'm increment the
59:07
version number associated with that
59:09
object and finally clear the lock bit on
59:11
that object and what that means is that
59:13
as soon as a primary
59:16
receives and processes a commit primary
59:18
log message it may since it clears the
59:21
lock a bit and updates the data it may
59:24
well expose this new data to other
59:27
transactions other transactions after
59:28
this point are free to use it are free
59:30
to use the object with its new value and
59:34
new version number all right I'm gonna
59:40
do an example any questions about the
59:44
machinery before I start thinking about
59:46
an example feel free to ask questions
59:51
any time alright so how about an example
59:55
let's suppose we have two transactions
59:59
transaction one and transaction two and
60:02
they're both trying to do the same thing
60:03
they both just wanna increment X X is
60:09
the object sitting off in some servers
60:12
memory so so both we got two
60:18
transactions running running through
60:19
this before we look into what actually
60:21
happens we should remind ourselves what
60:22
the valid possibilities are for the
60:26
outcomes so and that's all about
60:30
serializability farm guaranteed
60:32
serialize ability so that means that
60:33
whatever farm actually does it has to be
60:35
equivalent to some one at a time
60:37
execution of these two transactions so
60:40
we're allowed to see was the results you
60:41
would see if t1 ran and then strictly
60:44
afterwards t2 ran or we can see the
60:47
results that could ensue if t2 ran and
60:50
then t1 run those are the only
60:52
possibilities now in fact farm is
60:57
entitled to abort a transaction so we
61:00
also have to consider the possibility
61:01
that one of the two transactions aborted
61:03
or indeed that they both aborted I since
61:06
they're doing both doing the same thing
61:08
there's a certain amount of symmetry
61:09
here so one possibility is that they
61:15
both committed and that means two
61:18
increments happen so one legal
61:20
possibilities that X is equal to 2 and
61:25
both then the TX
61:28
it has to agree with whether things a
61:30
bit or aborted or committed so that both
61:35
transactions need to CTX commit returned
61:40
true in this case another possibility is
61:44
that only one of them transactions
61:46
committed and the other aborted and then
61:48
we want to see only one true and the
61:52
other false and another possibilities
61:56
maybe they both aborted we don't think
61:58
this could necessarily happen but it's
62:00
actually legal so that X isn't changed
62:03
and we want both to get false back from
62:09
TX commit so we better better not see
62:14
anything other than these three options
62:21
all right so of course what happens
62:24
depends on the timing so I'm going to
62:31
integrate out various different ways
62:33
that the commit protocol could in early
62:35
even for convenience I have a handy
62:39
reminder of what the actual commit
62:41
protocol is here so one possibility is
62:46
that they run exactly in lockstep they
62:51
both send all their messages at the same
62:55
time they both read at the same time I'm
62:57
going to assume that X starts out as
62:59
zero if they both read at the same time
63:00
that we're going to see zero I assume
63:03
they both sent out lakh messages at the
63:05
same time
63:08
and indeed they accompany their log
63:10
messages with the value one since
63:11
they're adding 1 to it and that if they
63:13
commit if they walk messages say yes
63:16
then they would if they did both commit
63:19
at the same time so if if this is the
63:25
scenario what's going to happen and why
63:31
you
63:34
they like to raise their hand and hazard
63:37
a guess
63:48
well that's really good field to be
63:50
since that's a one-sided read can't
63:52
possibly fail they're both gonna send in
63:55
fact identical walk messages to whatever
63:59
primary holds object X and I both send
64:04
the same version number but a version
64:06
number they read and the same value so
64:08
the primaries gonna see to log meant to
64:10
log messages in two different incoming
64:14
logs assuming these are running on
64:16
different clients and exactly what
64:23
happens now is slightly left up to our
64:25
imagination by the paper but I think the
64:28
two incoming log messages could be
64:29
processed in parallel on different cores
64:31
on the primary but the critical
64:35
instruction of the primary is the atomic
64:37
test and set or compare and swap exactly
64:41
somebody's volunteer the answer that one
64:44
of them will get to the compare and swap
64:46
instruction first and whichever core I
64:51
guess the compare and swap instruction
64:53
first it'll set the lock bit on that
64:58
objects version and will observe the
65:00
lock a bit wasn't previously set which
65:03
everyone executes the atomic
65:04
compare-and-swap second will observe the
65:06
lock that's already set I mean he's the
65:08
one of the two will return yes and the
65:11
other two will fail the lock observe the
65:13
lock is already set immature no and you
65:17
know it for symmetry I'm just going to
65:19
imagine that transaction to the primary
65:24
sends back a no so the transaction to
65:25
use client code will abort transaction 1
65:29
I've got the lock got a yes back and it
65:32
will actually commit when it come
65:35
it's when the primary actually gets the
65:38
commit message it'll install the updated
65:40
object
65:41
you know increments to to clear the lock
65:43
bit increment the version and return
65:46
true this is gonna say true because the
65:51
other primary sent back I know that
65:55
means that TX commits gonna return false
65:57
here and the final value would be x
66:00
equals one that was one of our allowed
66:03
outcomes but of course it's not the only
66:06
in are leaving any questions about how
66:12
this played out or wide executed the way
66:16
it did
66:19
okay so there's other possible
66:21
interleavings so how about how about
66:25
this one let's imagine that transaction
66:30
2 does the beat first
66:33
she doesn't really matter what the reads
66:37
are concurrent or not then transaction
66:39
one doesn't read and then transaction
66:41
went a little bit faster and it gets its
66:43
lock message in and a reply and gets a
66:47
commit back and then afterwards
66:50
transaction two gets going again and
66:55
sends a lock message in if it could
66:58
commit so what happens this time
67:16
well is this law commissioner is gonna
67:20
be succeed because there's no reason to
67:22
believe there's a lock bit is set
67:24
because the second lock message hasn't
67:26
even been sent message we'll set the
67:28
lock the commit message this commit
67:31
primary message should actually clear
67:32
the lock a bit so the lock bit will be
67:35
clear by the time t2 census inserts its
67:39
lock entry in primaries log so this the
67:49
primary won't see the lock a bit set at
67:51
this point yeah so somebody's
67:54
volunteered that what this primary will
67:57
see is that the version number so the
68:00
the lock message contains the version
68:01
number the transaction to originally
68:03
read and so the primary is gonna see
68:05
wait a minute this since commit primary
68:08
increments of version number the the
68:11
primary is gonna see that the version
68:12
number is wrong there's numbers now
68:14
higher on the real object and so it's
68:16
actually gonna send back a a no response
68:20
to the coordinator and the coordinator
68:24
is gonna abort this transaction and
68:26
again we're gonna get x equals 1 one of
68:29
the transactions return true the other
68:31
returned false which is the same final
68:35
outcome as before and it is allowed any
68:40
questions about how this played out a
68:44
slightly different scenario would be as
68:47
if and actually okay the slightly
68:51
different scenario I was gonna think of
68:52
think of was one in which the commit
68:54
message was stole it happened after this
68:57
lock this is essentially the same as the
68:59
first scenario in which this transaction
69:04
got the lock set in this transaction
69:05
observed lock okay
69:10
everyone one last scenario let's suppose
69:18
we see this
69:32
what's going to happen this time
69:34
[Music]
69:47
yeah somebody has a right answer at the
69:51
of course the first transaction will go
69:53
through because there's no contention in
69:54
the first transaction the second
69:56
transaction when it goes to read X will
69:58
actually see the new version number as
70:02
incremented by the commit primary
70:04
processing on the primary so it'll see
70:07
the new version number the lock that
70:09
won't be set and so then when it goes to
70:11
send its lock log entry to the primary
70:16
lock lock that locked processing code in
70:19
the primary Co the locks not set and the
70:21
version is the same hasn't this is the
70:23
latest version and it all I want to
70:24
commit and so for this the outcome we're
70:26
gonna see is x equals 2 because this
70:29
read not only read the new version um
70:31
but actually read the new value which
70:32
was one so this is incorrect here and
70:40
both calls to TX commit will be true yes
70:47
that's right succeed it with x equals 2
70:51
all right so you know this happened to
70:53
work out in these cases the intuition
70:56
behind why optimistic concurrency
70:59
control provides serializability why it
71:03
why it basically checks that the
71:06
execution that did happen is the same as
71:09
a one at a time execution essentially
71:13
the intuition is that if there was no
71:15
conflicting transaction then the version
71:17
numbers and the lock bits won't have
71:19
changed if nobody else is messing with
71:20
these objects you know I'll see the same
71:23
version numbers at the end of the
71:25
transaction as we did when we first read
71:27
the object whereas if there is a
71:30
conflicting transaction between when we
71:32
read the object and when we try to
71:33
commit a change and that conflicting
71:37
modified something then if it actually
71:42
started to commit we will see a new
71:43
version number or a lock a bit set so
71:47
the comparison of the version numbers
71:48
and lock bits between when you first
71:49
read the object and when you finally
71:51
commit it kind of tells you whether some
71:53
other commits to the objects snuck in
71:56
while you were using them all right and
72:02
you know the cool thing to remember here
72:03
is that this allowed us to do the reads
72:08
the use of this optimistic schema which
72:11
we don't actually check the locks only
72:13
when we first use the data allowed us to
72:15
use this extremely fast one sided
72:17
already ma reads to read the data and
72:20
get high performance ok so the way I've
72:24
explained it so far without validate and
72:28
without commit back up is the way the
72:29
system works but as I see validate is
72:34
sort of an optimization for just reading
72:38
an object but not writing it and commit
72:41
back up as part of the scheme for fault
72:43
tolerance I think I'm gonna a few
72:46
minutes we have left I want to talk
72:47
about validate so the validate stage is
72:52
it's an optimization for to treat
72:56
objects that we're only read by the
72:57
transaction and I'm not written and it's
72:59
going to be particularly interesting if
73:00
it's a straight read-only transaction
73:03
that modified nothing and you know the
73:05
optimization is that it's going to be
73:08
that the transaction coordinator can
73:11
execute the validate with a one-sided
73:13
read that's extremely fast rather than
73:15
having to put something on a log and
73:17
wait for the primary to see our log
73:20
entry and think about it so this
73:22
validates one-sided B is going to be
73:24
much much faster it's gonna essentially
73:26
replace lock for objects that would only
73:28
read it's gonna be much faster
73:35
basically what's going on here is that
73:36
the what what the validate does is the
73:41
transaction coordinator refetch is the
73:44
object header so you know it would have
73:46
read an object say this object in the
73:49
execute phase when it's committing it
73:51
instead of sending a lock message it be
73:54
fetches the object hit header and checks
73:56
whether the version number now is the
73:59
same as the version number when it first
74:01
read the object and it also checks if
74:03
the lock of it is clear so so that's how
74:10
it works
74:10
so instead of setting a lock message
74:12
send this validate message should be
74:13
much faster for a read-only operation so
74:17
let me put up another transaction
74:21
example and run through it how it works
74:23
let's suppose x and y are initially 0 we
74:26
have two transactions t1 if X is equal
74:32
to zero set y equal one and T two says
74:40
if Y is zero
74:44
said x equals one but this is a
74:47
absolutely classic test for strong
74:51
consistency if the execution is
74:56
serializable it's going to be either t1
75:00
then t2 or t2 and t1 it's got to get to
75:05
see any you know corrected
75:07
implementation has to get the same
75:08
results it's running them one at a time
75:10
if you run T 1 and then t2 you're gonna
75:13
get y equals 1 and x equals 0 because
75:18
the second if statement Y is already 1
75:21
the second if statement won't do
75:22
anything and symmetrically this will
75:26
give you x equals 1 and y equals 0 and
75:31
it turns out that if you if they both
75:33
abort you can get x equals 0 y equals 0
75:36
but what you are absolutely not allowed
75:38
to get is x equals 1 y equals 1 that's
75:45
not allowed
75:48
ok so we're looking for how I'm going to
75:53
use this as a test see what happens with
75:56
validate and again we're gonna suppose
75:58
these two transactions execute most so
76:04
obvious cases they execute it absolutely
76:06
at the same time and it eat that's the
76:11
that's the hardest case okay so as we
76:17
have read of X meet Y
76:27
why because we wrote it and lock why
76:29
here I sort of lock X here but since now
76:35
we're using this read-only a validation
76:37
optimization that means this one has to
76:39
validate why this one has to validate X
76:41
you know it's a red X but didn't write
76:43
it so it's going to validate it much
76:45
quicker and maybe it's going to commit
76:47
and maybe it's and so the question is if
76:50
we use this validate as I described it
76:53
that just checks the version number and
76:54
lock but haven't the version number
76:56
hasn't changed in the lock but isn't set
76:58
will we get a a correct answer
77:22
and no actually both the validation is
77:25
gonna fail for both because when these
77:29
LOC messages were processed by the
77:31
relevant primaries they cause the LOC a
77:33
bit just to be set initially presumably
77:36
the the reason okay did a cleared lock
77:38
bin but when we come to validate even
77:42
though the client is doing the one-sided
77:44
read of the object header for X&Y it's
77:48
gonna see the lock bit that was set by
77:50
the processing of these lock requests
77:55
and so they're both gonna see the lock
77:57
bits set on the object that they merely
77:59
read and they're both going to abort and
78:04
neither X nor Y will be modified and so
78:08
that was one of the legal outcomes
78:10
that's right somebody somebody notice
78:12
this indeed both validates will fail
78:16
another of course sometimes that a
78:19
transaction can go through and here's a
78:21
scenario in which it does work out
78:27
this was transaction one is a little
78:30
faster validates
78:43
all right so what's going to happen a
78:45
transaction one is a little bit faster
78:50
so this time it's validates gonna
79:05
succeed because nothing has happened to
79:07
X between when transaction 1 read it and
79:09
when it validated so presumably the lock
79:12
also went through without any trouble
79:14
because nobody's modified Y here either
79:15
so the primary answered yes for this the
79:18
one-sided read revealed an unchanged
79:21
version number and lock bit here and so
79:24
transaction one can commit and it will
79:26
have incremented Y but by this point if
79:29
this is the order when the primary
79:32
process is this actually when the
79:37
primary process is lock of X this will
79:38
also go through with no problem because
79:40
nobody's modified X when the primary for
79:43
Y processes the validate for Y though
79:47
it's I'm sorry when the client running
79:51
transaction two refetch is the version
79:54
number unlocked it for y it's either
79:57
gonna see this really depends on whether
79:59
the committee's happen if the commit
80:02
hasn't happened yet this valid a will
80:03
see that the lock bit is set because it
80:05
was set back here if the commit has
80:07
happened already then the lock bit of
80:09
will be clear but this validate
80:11
one-sided reader will see a different
80:13
version number than was originally seen
80:17
and it needs somebody it's just this
80:18
answer so one will commit so that
80:20
transaction one will commit and
80:23
transaction to will abort
80:25
and although I don't have time to talk
80:27
about it here if there's a straight
80:29
read-only transaction then there doesn't
80:31
need to be a locking phase and there
80:33
doesn't need to be a commit phase pure
80:35
read-only transactions can be done with
80:36
just just reading blind reads for the
80:39
reads
80:40
sorry one-sided RDMA reads for the reads
80:42
one-sided already me reads for the
80:44
validates and so they're extremely fast
80:46
read-only transactions are and don't
80:48
require any work any attention by the
80:52
server
80:54
so and this is at the heart you know
80:58
trends these reads and indeed though
81:00
everything about farm is very
81:04
streamlined - partially due to our DMA
81:06
and it uses OCC because it's basically
81:09
forced to in order to be able to do
81:12
reads without checking locks there are a
81:15
few brown downsides though it turns out
81:17
optimistic concurrency control really
81:18
works best if there's relatively few
81:20
conflicts if there's conflicts all the
81:23
time then transactions will have to
81:26
board and there's a you know a bunch of
81:27
other restrictions I already mentioned
81:29
like on farm like the data must all fit
81:31
in the RAM and all the computers must
81:33
mean that the same data center
81:35
nevertheless this was viewed at the time
81:39
and still as just a very surprisingly
81:41
high-speed implementation of distributed
81:45
transactions like just much faster than
81:48
any system in sort of in production use
81:52
and it's true that Hardware involves a
81:54
little bit exotic and really depends on
81:56
this non-volatile Ram scheme and it
81:58
depends on these special RDMA NICs and
82:01
those are not particularly pervasive now
82:04
but you do but you can get them and with
82:08
performance like this it seems likely
82:09
that they'll both in viewing and already
82:11
me will eventually be pretty pervasive
82:14
in data centers so that people can play
82:16
these kind of games and that's all I
82:19
have to say about farm happy to take any
82:23
questions if anybody has some and if not
82:26
I'll see you next week with a spark
82:29
which is you may be happy to know
82:31
absolutely not about transactions I
82:33
heard everyone bye-bye
82:35
[Music]