字幕記錄


00:00
um maybe maybe we should get started um
00:08
it's been a long time since we've all
00:10
been in the same place at night I hope
00:11
everybody's doing well today I'd like to
00:16
talk about spanner the reason to talk
00:19
about this paper is that it's a rare
00:21
example of a system provides distributed
00:25
transactions over data that's widely
00:27
separated that is data that might be
00:28
scattered all over the internet and
00:30
different data centers I'm Saul most
00:32
never done in production systems of
00:36
course it's extremely desirable to be
00:37
able to have transactions the
00:39
programmers really like it and also
00:42
extremely desirable to have data spread
00:44
all over the network for both for fault
00:48
tolerance and to ensure that data is
00:50
near that there's a copy of the data
00:53
near everybody who wants to use it and
01:01
on the way to achieving this spanner
01:04
used at least two neat ideas one is that
01:07
they run two-phase commit but they
01:09
actually run it over paxos replicated
01:12
participants I'm in order to avoid the
01:15
problem the two-phase commit that a
01:17
crashed coordinator can block everyone
01:20
and the other interesting idea is that
01:22
they use synchronize time in order to
01:25
have very efficient read-only
01:26
transactions and the system is is that
01:30
actually been very successful it's used
01:32
a lot by many many different services
01:34
inside of Google it's been turned by
01:38
Google into a product to service for
01:40
their cloud-based customers and it's
01:43
inspired a bunch of other research and
01:46
other systems both sort of by the
01:48
example that it's kind of wide area
01:51
transactions are possible and also
01:53
specifically there's at least one opens
01:55
her system cockroach DB that uses a lot
01:58
of explicitly uses a lot of the design
02:01
the motivating use case the reason that
02:05
the paper says they first kind of
02:06
started the design spanner was that they
02:09
were already had a actually they had
02:11
many big database systems and
02:13
Google but their advertising system in
02:15
particular the data was shorted over
02:19
many many distinct my sequel and
02:22
BigTable databases and maintaining that
02:25
sharding was a just an awkward and
02:28
manual and time-consuming process in
02:30
addition their previous advertising
02:33
database system didn't allow
02:36
transactions that spanned more than a
02:38
single basically more than a single
02:40
server but they really wanted to be able
02:42
to have to spread their data out more
02:44
widely for better performance and to
02:47
have transactions over the multiple
02:51
shards of the data for their advertising
02:55
database apparently the workload was
02:56
dominated by read-only transactions I
02:59
mean you can see this in table 6 where
03:00
the there's billions of read-only
03:03
transactions and only millions of
03:06
readwrite transactions so they're very
03:09
interested in the performance of
03:11
read-only of transactions that only do
03:13
weeds and apparently they also required
03:16
strong consistency and that you know
03:18
what transactions in particular so they
03:21
wanted serializable transactions and
03:22
they also wanted external consistency
03:27
which means that if one transaction
03:28
commits and then after it finishes
03:33
committing another transaction starts
03:34
the second transaction needs to see any
03:37
modification is done by the first and
03:41
this external consistency turns out to
03:43
be interesting with replicated data all
03:50
right so
03:52
ownage are just a basic arrangement sort
03:56
of physical arrangement of their servers
03:57
that that spanner uses it has the its
04:01
servers are spread over data centers
04:03
presumably all over the world certainly
04:06
all over the United States and each
04:08
piece of data is replicated at multiple
04:10
data centers so the diagrams got to have
04:14
multiple data centers let's say there's
04:17
there's three data centers really
04:19
there'd be many more oops
04:26
so we have all day
04:27
dissenters then the data shard it that
04:29
it's broken up you can think of it has
04:31
been being broken up by key into and
04:35
split over many servers so maybe there's
04:37
one server that serves keys starting
04:39
with a in this data center or another
04:42
starting with B and so forth lots of
04:46
lots of charting with lots of servers in
04:48
fact every data center has on any piece
04:52
of data is any shard is replicated at
04:55
more than one data center so there's
04:57
going to be another copy another replica
04:58
of the a keys and the B keys and so on
05:01
the second day in the center and yet
05:04
another hopefully identical copy of all
05:08
this data at the third data center in
05:10
addition each data center has multiple
05:14
clients or their clients of spanner and
05:17
what these clients really are as web
05:19
servers so if our ordinary human beings
05:22
sitting in front of a web browser
05:24
connects to some Google service that
05:27
uses spanner
05:28
they'll connect to some web server in
05:30
one of the data centers and that's going
05:31
to be one of these one of these spanner
05:35
clients all right so that is replicated
05:40
the replication is managed by Paxos in
05:43
fact that really a variant of Paxos that
05:45
has leaders and is really very much like
05:48
the raft that we're all familiar with
05:50
and each Paxos instance manages all the
05:53
replicas of a given shard of the data so
05:56
this shard all the copies of this shard
06:00
form one Paxos group and all the
06:06
replicas are this shard form other packs
06:08
was group and within each these are
06:09
these patches instances independent as
06:13
its own leader runs its own version of
06:14
the of the poem instance of the packs
06:18
was protocol numb and the reason for the
06:21
sharding and for the independent paxos
06:25
instances per shard is to allow parallel
06:29
speed-up and a lot of parallel
06:31
throughput because there's a vast number
06:34
of clients you know which are
06:35
representing working on behalf of web
06:37
browsers so this huge number typically
06:39
of concurrent
06:41
requests and so it pays and more
06:43
immensely to split them up over multiple
06:46
shards and multiple sort of Paxos groups
06:52
that are running in parallel okay and
06:56
you can think of or each of these paxos
06:59
groups has a leader a lot like wrath so
07:02
maybe the leader for this shard isn't
07:04
data is a replica in datacenter one and
07:06
the leader for this shard might be the
07:10
replica and datacenter two and and so
07:13
forth and you know so that means that if
07:18
you need to if a client needs to do a
07:21
right it has to send that right to the
07:23
leader of the of the shard whose data it
07:28
needs to write just with Raph these
07:32
Paxos instances are what they're really
07:34
doing is sending out a log the leader is
07:36
sort of replicating a log of operations
07:38
to all the followers and the followers
07:40
execute that log which is for data is
07:42
gonna be reads and writes so it executes
07:45
those logs all in the same order all
07:53
right so the reason for these for this
07:55
setup the sharding as I mentioned for
07:58
throughput the multiple copies in
08:00
different data centers is for two
08:03
reasons one is you want copies and
08:06
different data centers in case one data
08:07
center fails if you know maybe you power
08:10
fails to the entire city the data
08:12
centers in or there's an earthquake or a
08:14
fire or something you'd like other
08:16
copies that other data centers that are
08:19
maybe not going to fail at the same time
08:20
and then you know there's a price to pay
08:22
for that because now the paxos protocol
08:24
now has to talk maybe over long
08:27
distances to talk to followers and
08:29
different data centers the other reason
08:31
to have data in multiple data centers is
08:33
that it may allow you to have copies of
08:35
the data near all the different clients
08:39
that use it so if you have a piece of
08:40
data that may be read in both California
08:42
and New York maybe it's nice to have a
08:45
copy of that data one copy in California
08:48
one copy in New York so that reads can
08:50
be very fast and indeed a lot of the
08:53
focus that
08:53
time is to make reads from the local the
08:57
nearest replica both fast and correct
09:02
finally another interesting interaction
09:04
between Paxos and multiple data centers
09:06
is that
09:07
paxos lie craft only requires a majority
09:10
in order to replicate a log entry and
09:13
proceed and that means if there's one
09:14
slow or distant or flaky data center the
09:18
Paxil system can keep chugging along and
09:20
accepting new requests even if one data
09:22
center is is being slow all right so
09:28
with this arrangement there's a couple
09:31
of big challenges that paper has to bite
09:33
off one is they really want to do reads
09:35
from local data centers but because
09:39
they're using Paxos and because Paxos
09:41
only requires each log entry to be
09:45
replicated on a majority that means a
09:47
minority of the replicas may be lagging
09:49
and may not have seen the latest data
09:52
that's been committed by paxos and that
09:56
means that if we allow clients to read
09:58
from the local replicas for speed they
10:02
may be reading out-of-date data if their
10:04
replica happens to be in the minority
10:06
that didn't see the latest updates so
10:08
they have to since they're requiring
10:09
correctness they're requiring this
10:11
external consistency idea that every
10:16
read see the most up-to-date data they
10:18
have to have some way of dealing with
10:20
the possibility that the local replicas
10:22
may be lagging another issue they have
10:26
to deal with is that a transaction may
10:28
involve multiple shards and therefore
10:30
multiple paxos groups so you may be
10:32
reading or writing a single transaction
10:34
may be reading or writing multiple
10:35
records in the database that are stored
10:37
in multiple shards and multiple Paxil
10:40
scripts so those have to be we need
10:42
distributed transactions okay so I'm
10:49
going to explain how the transactions
10:51
work that's going to be the kind of
10:52
focus of the lecture spanner actually
10:56
beats implements readwrite transactions
10:58
quite differently from read-only
11:00
transactions so let me start with your
11:02
beauty of readwrite transactions which
11:03
are so have a lot more conventional in
11:07
their design alright so first readwrite
11:20
transactions let me just remind you at a
11:27
transaction looks like so let's just
11:30
choose a simple one that's like
11:32
mimicking bank transfer so I'm one of
11:37
those client machines a client of
11:39
spanner you'd run some code you run this
11:41
transaction code the code would say oh
11:42
I'm beginning a transaction and then I
11:45
would say oh I want to read and write
11:46
these records so maybe you have a bank
11:48
balance in database record X and we want
11:50
to you know increment and increase this
11:53
bank balance and decrease y's bank
11:56
balance and oh that's the end of the
11:58
transaction and now the client hopes the
12:01
database will go off and commit that
12:05
alright so I want to trace through all
12:08
the steps that that have to happen in
12:11
order to in order for spanner to execute
12:15
this read write transaction so first of
12:17
all there's a client in one of the data
12:18
centers that's driving this transaction
12:21
so I'll draw this client here let's
12:24
imagine that x and y are on different
12:25
shards since that's the that's the
12:28
interesting case and that those shards
12:31
each of the two shards is replicated in
12:35
three different data centers so know we
12:38
got our three data centers here and at
12:44
each data center there's a server that
12:47
I'm just going to write x for the
12:51
replicas of the shard that's holding act
12:55
with the bank balance for x and y for
12:58
the these three servers spinner once
13:03
two-phase commit just to totally stand
13:06
our two-phase commit and two phase
13:08
locking almost exactly as described in
13:13
the reading from last week from the 603
13:16
three textbook and the huge difference
13:18
is that instead of the participants and
13:22
the transaction manager being individual
13:24
computers the participants in the
13:26
transaction manner manager are Paxos
13:30
replicated
13:33
groups of servers for increased fault
13:35
tolerance so that means just to remind
13:37
you that the shard the three replicas of
13:42
the shard that stores X it's a really
13:44
app access group same with these three
13:46
replicas strong Y and we'll just imagine
13:49
that for each of these one of the three
13:51
servers is the leader so let's say the
13:53
server and data center 2 is the Paxos
13:56
leader for the X is shard and the
14:00
servant is saying one is the Paxos
14:02
leader for y sharp okay so the first
14:08
thing that happens is that the coin
14:09
picks a unique transaction ID which is
14:11
going to be carried on all these
14:13
messages so that the system knows that
14:16
all the different operations are
14:18
associated with a single transaction the
14:21
first thing that does the client has to
14:22
be so despite the way the code looks
14:25
where it reads and writes X then read
14:27
some write Y in fact the way the code
14:30
has transaction code has to be organized
14:32
it has to do all its reads first and
14:34
then at the very end do all the writes
14:36
at the same time essentially as part of
14:39
the commit so the clients to do good
14:44
reads it turns out that it in order to
14:49
maintain locks since just as as in last
14:53
week's 6:53 reading every time you read
14:57
or write a data item the server
15:00
responsible for it has to associate a
15:03
lock with that data item the locks are
15:06
maintained the read locks and spanner
15:09
maintain only in the Paxos leader so
15:12
when the client transaction wants to
15:14
read access sends a read X request to
15:18
the leader of X is shard and that leader
15:23
of the shard returns the current value
15:25
of x plus sets a lock on X of course if
15:28
the locks already set then you won't
15:30
respond to the client until whatever
15:32
transaction currently has the data
15:34
locked releases the lock by committing
15:36
and then the leader for that shard since
15:40
back the value of x to client the client
15:42
needs to read Y got lucky this time
15:44
because the
15:46
assuming like clients in data center one
15:48
the leaders in the local data center so
15:52
this reads gonna be a lot faster that
15:54
reads sets the lock on Y in the taxes
15:58
leader and then returns okay's now the
16:00
clients on all the reads it does
16:02
internal computations and figures out
16:03
the writes that wants to do what values
16:05
wants to write to x and y and so now the
16:09
clients going to send out the updated
16:12
values for the records that it wants to
16:16
write and it does this all at once at
16:18
the end towards the end of the
16:20
transaction so the first thing it does
16:23
is it chooses one of the packs those
16:25
groups to act as the transaction
16:28
coordinator and as it chooses us in
16:31
advance and it's gonna send out the
16:33
identity of the which Paxos group is
16:35
going to act as the transaction
16:37
coordinator so let's assume it chooses
16:40
this Paxos group i've split a double box
16:42
here to say that not only is this server
16:45
the leader of its Paxos group it's also
16:47
acting as transaction coordinator for
16:49
this transaction then the client sends
16:52
out the updated values that it wants to
16:57
write so it's going to send a write
16:58
extra write X request here with a new
17:02
value and the identity of the
17:03
transaction coordinator when each the
17:09
Paxos leader for each written value
17:11
receives the write request it sends out
17:18
a prepare message to its followers and
17:22
gets that into the Paxos log so that
17:24
I'll represent that by P into the paxos
17:27
log because it's gonna commit to being
17:30
able to come it's the wrong word it's
17:33
promising to be able to carry out this
17:35
transaction that it hasn't crashed for
17:36
example and lost its locks so it sends
17:40
out this prepare message logs the
17:43
prepare message through paxos when it
17:45
gets a majority of responses from the
17:47
followers then this pacts was leader
17:51
sends a yes to the transaction
17:54
coordinator saying yes I am a promising
17:58
to be able to carry out my part of the
17:59
trend
18:00
right to Y V and notionally the
18:07
transaction see the client also
18:09
sentenced the value to be bitten - why -
18:15
why's paxos leader and this server
18:21
acting as paxos leader sends out prepare
18:24
messages to his followers and logs it
18:28
impacts those weights for the
18:29
acknowledgments from a majority and then
18:33
you can think of it as as the Paxos
18:36
leaders sending the transaction
18:40
coordinator which is on the same machine
18:43
maybe the same program a yes vote saying
18:45
yes I can I can commit okay so when the
18:48
transaction coordinator gets responses
18:52
from all the different from the leaders
18:54
of all the different shards whose data
18:56
is involved in this transaction if they
18:59
all said yes then the transaction
19:01
coordinator can commit otherwise it
19:03
can't let's assume it decides to commit
19:09
at that point the transaction
19:11
coordinator sends out to the paxos
19:15
followers a commit message saying look
19:19
please remember that permanently in the
19:22
transaction log that we're committing
19:26
this transaction and it also tells the
19:34
leaders of the other PACs those groups
19:37
involved in the transaction then they
19:39
can commit as well and so now this
19:42
leader sends out commit messages to his
19:45
followers as well as soon as the commits
19:49
are
19:54
the transaction coordinator probably
19:56
doesn't send out the commit message to
19:58
the other shards until it's committed as
20:00
safe in the log so that the transaction
20:02
coordinator is not guaranteed not to
20:03
forget its decision once commits these
20:07
commit messages are committed into the
20:09
paxos logs of the different shards each
20:12
of those shards can actually execute the
20:14
rights that is place the written data
20:16
and release the locks on the data items
20:21
so that other transactions can use them
20:26
and then the transactions over so first
20:35
of all please feel free to ask questions
20:36
if by raising your hand a few if you
20:40
have questions ok so there's some points
20:45
to observe about the design so far which
20:48
is only covered the readwrite aspect of
20:50
transactions one is that it's that the
20:53
locking that is insuring serializability
20:56
that is of two transactions conflict
20:59
because they use the same data one has
21:01
to completely wait for the other
21:02
releases locks before it can proceed so
21:05
it's using so spanners using completely
21:07
standard two-phase locking in order to
21:11
get serializability and completely
21:13
standard two-phase commit to get
21:15
distributed transactions the two-phase
21:19
commits widely hated because if the
21:22
transaction coordinator should fail or
21:25
become unreachable than any transactions
21:27
it was managing block indefinitely until
21:31
the transaction coordinator comes back
21:32
up and they block with locks help so
21:35
people have been in general very
21:37
reluctant to use two-phase commit in the
21:40
real world because it's blocking spanner
21:43
solves this by replicating the
21:46
transaction manager the transaction
21:48
manager itself is a Paxos replicated
21:50
state machine so everything it does like
21:53
for example remember whether it's
21:55
committed or not is replicated into the
21:57
paxos log so if the leader here fails
22:02
even though it was managing the
22:04
transaction because it's raft replicated
22:07
either of these two
22:08
replicas can spring to life take over
22:11
leadership and also take over being the
22:14
transaction manager and they'll have in
22:16
their law it's the transaction manager
22:17
decided to commit any leader that takes
22:20
over will see a commitment it's log and
22:22
be able to then tell the other right
22:25
away tell the other participants and
22:27
two-phase commit that look oh this
22:28
transaction was committed so this
22:30
effectively eliminates the problem a
22:34
two-phase commit that it can block with
22:36
locks held if there's a failure this is
22:38
a really big deal because this problem
22:41
basically makes two-phase commit
22:43
otherwise completely unacceptable for
22:45
any sort of large-scale system that has
22:47
a lot of parts that might fail the other
22:49
another thing to note is that there's a
22:51
huge amount of messages on in this
22:55
diagram here and that means that many of
22:59
them are across data centers and said
23:02
the some of these messages that go
23:03
between the shards or between a client
23:05
and a shard whose leaders in another
23:07
data center may take many milliseconds
23:09
and in a world in which you know
23:12
computations take nanoseconds this is
23:16
essentially pretty grim expense and
23:22
indeed you can see that from in table
23:25
six and table six if you look at it it's
23:28
describing the performance of a spanner
23:32
deployment where the different replicas
23:33
are on different sides of the United
23:35
States I east and west coast and it
23:38
takes about a hundred milliseconds to do
23:41
complete a transaction where the
23:43
different replicas involved are on
23:45
different coats that's a huge amount of
23:47
time it's a tenth of a second there's
23:50
maybe not quite as bad as it may seem
23:51
because the throughput of the system
23:53
since it's sharded and it can run a lot
23:55
of non conflicting transactions in
23:57
parallel the throughput may be very hard
23:58
high but their delay for individual
24:02
transactions very significant I mean a
24:03
hundred milliseconds is maybe somewhat
24:05
less than a human is going to notice but
24:07
if you have to do a couple of them to
24:09
just say generate a webpage or carry out
24:11
a human instruction it's starting to be
24:13
amount of time whoops
24:14
you noticeable start to be bothering
24:16
bothersome
24:19
on the other hand for I think I suspect
24:21
from many uses of spanner all the
24:23
replicas might be in in the same city or
24:26
sort of across town and they're the much
24:29
faster times that you can see in Table
24:31
three are relevant in the Earth's Table
24:34
three shows that it can complete
24:36
transactions where the data centers are
24:38
nearby in all right you know I think
24:40
it's 14 milliseconds instead of 100
24:42
milliseconds so that's not quite so that
24:44
nevertheless these read/write
24:47
transactions are slow enough that we'd
24:50
like to avoid the expense if we possibly
24:54
can so that's going to take us to
24:58
read-only transactions it turns out that
24:59
if you're not writing that is if you
25:01
know in advance that all of the
25:03
operations in a transaction are
25:05
guaranteed to be reads then spanner has
25:08
a much faster much more streamlined much
25:10
less massive message intensive scheme
25:13
for executing read-only transactions
25:19
okay so so read-only transactions start
25:28
a new topic the reader only transactions
25:30
work although they rely on some
25:32
information from readwrite transactions
25:33
to designs quite different from the read
25:38
of the readwrite transactions in spanner
25:45
eliminates two big costs and it's
25:49
read-only transaction design eliminates
25:50
two of the costs that were present and
25:53
readwrite transactions first of all as I
25:54
mentioned it reads from local replicas
25:57
and so if you have a replica as long as
26:00
there's a replica the DVD the client
26:02
needs the transaction needs in the local
26:04
data center you can do the read and from
26:06
that local replica which may take a
26:08
small fraction of a millisecond to talk
26:10
to instead of maybe dozens of
26:12
milliseconds if you have to go cross
26:14
country so it can read from local
26:15
replicas but node you know again a
26:18
danger here is that any given replicas
26:20
may not be up-to-date so there has to be
26:23
a story for that
26:24
and the other big savings and the
26:27
read-only design is that it doesn't use
26:29
locks it doesn't use two-phase commit I
26:32
mean that doesn't need a transaction
26:33
manager and this the voids things like
26:37
cross data center or inter data center
26:39
messages to PACs those leaders and
26:42
because no locks are taken out not only
26:44
does that make the read-only
26:45
transactions faster but it avoids
26:47
slowing down read only read write
26:49
transactions because they don't have to
26:50
rate for locks held by read-only
26:52
transactions now I mean just to kind of
26:54
preview why this is important to them
26:57
tables 3 & 6 show a ten times latency
27:01
improvement for read-only transactions
27:03
compared to readwrite transactions so
27:07
the main only design is submit factor
27:10
ten boost in latency and much less
27:13
complexity is almost certainly far more
27:15
throughput as well and the big challenge
27:17
is going to be how to square the you
27:19
know really transactions don't do a lot
27:21
of things that were quiet required and
27:23
we don't rewrite transactions to get
27:25
serialize ability so we need to find a
27:28
need to find a way to kind of square
27:30
this increased efficiency with
27:32
correctness and so there's really two
27:35
main correctness constraints that they
27:39
wanted to have read-only transactions
27:42
imposed the first is that they like all
27:44
transactions they still need to be
27:46
serializable and what that means is that
27:52
even though just a review even though
27:55
the system may execute transactions
27:58
concurrently in parallel the results
28:01
that a bunch of concurrent transactions
28:04
must yield both in terms of sort of
28:06
values that they return to the client
28:08
and modifications to the database the
28:10
results of a bunch of concurrent
28:12
transactions must be the same as some
28:14
one at a time or serial execution of
28:19
those transactions and for read-only
28:23
transactions what that essentially means
28:25
is that the an entire all the reads of a
28:28
read-only transaction must effectively
28:30
fit neatly between all the rights of a
28:34
bunch of transactions that can be viewed
28:37
as going before it
28:38
and and it must not see any of the
28:41
rights of the transactions that we're
28:42
going to view as it's going after it so
28:45
we need a way to sort of fit to read all
28:47
the reads of a transaction read-only
28:48
transaction kind of neatly between
28:50
readwrite transactions well the other
28:56
big constraint that the paper talks
29:00
about is that they want external
29:01
consistency and what this means it's
29:08
actually equivalent to linearise ability
29:15
that we've seen before what this really
29:16
means is that if one transaction commits
29:19
finishes committing and another
29:21
transaction starts after the first
29:24
transaction completed in real time then
29:27
the second transaction is required to
29:30
see the rights done by the first
29:32
transaction another way of putting that
29:34
is that transactions even read-only
29:36
transactions should not see stale data
29:39
and if there's a committed write from a
29:42
completed transaction that's prior to
29:45
the readonly transaction prior to the
29:47
start of the read-only transaction the
29:49
read-only transaction is required to see
29:51
that right ok so this is actually none
29:57
of neither of these is particularly
29:58
surprising but standard databases like
30:02
my sequel or something for example can
30:07
be configured to provide this kind of
30:09
consistency so in a way it's sort of the
30:11
consistency that if you didn't know
30:14
better this is exactly the consistency
30:16
that you would expect of a
30:18
straightforward system and in the you
30:21
know
30:23
have it but it makes programmers lives
30:24
it makes it much easier to produce
30:27
correct answers in otherwise you don't
30:30
have this kind of consistency then the
30:32
programmers are responsible for kind of
30:34
programming around whatever anomalies
30:36
the database may provide so this is like
30:38
a night this is sort of the gold
30:39
standard of correctness
30:42
okay so let's I want to gonna talk about
30:48
how we'd only transactions work it's a
30:50
bit of a complex story so I think what
30:52
I'd like to talk about first is to just
30:54
consider what would happen if we did
30:57
just absolutely the stupidest thing and
30:59
had the read-only transactions not do
31:02
anything special to achieve consistency
31:05
but just read the very latest copy of
31:07
the data so every time I read only
31:09
transaction does a read we could just
31:12
have it look at the local replicas and
31:15
find the current most up-to-date copy of
31:20
the data and that would be very
31:21
straightforward very low overhead so we
31:24
need to understand why that doesn't work
31:27
in order so this is a so why not read
31:34
the just a the latest value and so maybe
31:43
we'll imagine that the transaction is a
31:45
transaction that simply reads x and y
31:51
and prints them finance read-only I'm
31:54
going to print Y I'll just print X comma
31:57
Y
32:01
okay so all I want to show you an
32:03
example of a situation in which read
32:07
having this transaction is just simply
32:08
be the latest value yields incorrect not
32:12
not serializable results so suppose we
32:14
have three transactions running t1 t2 t3
32:21
t3 is going to be our transaction t1 and
32:24
t2 or transactions that are our rewrite
32:27
transactions so let's say that t1 right
32:31
sex and rights why and then commits and
32:36
you know maybe it's a bank transfer
32:37
operation so it's transferring money
32:39
from X to Y and we're printing x and y
32:41
because we're doing an audit of the bank
32:42
try to make sure it hasn't lost money
32:44
let's imagine that transaction 2 also
32:48
does another transfer between balances x
32:53
and y and then commits and now we have
32:54
our transaction transaction t3 it needs
32:57
to read x and y so it's gonna have a
32:59
read of X let's say the read of X
33:01
happens at this point in time and so I'm
33:04
the way I'm drawing these diagrams is
33:07
that real time moves to the right wall
33:10
clock time time you'd see on your watch
33:12
moves to the right so the read of X
33:14
happens here after transaction 1
33:17
completes before transaction 2 starts
33:20
and let's say T 3 is running on a slow
33:22
computer so it only manages to issue the
33:24
read of Y much later so the way this is
33:29
gonna play out is that transaction 3
33:31
will see the Y value that t1 wrote but
33:35
the x value that t2 wrote
33:41
assuming it uses this dubious procedure
33:45
of simply reading the latest value
33:47
that's in the database and so this is
33:51
not serializable because well we know
33:56
that any serial order that could exist
33:59
must have t1 followed by t2 there's only
34:06
2 places teeth rica go so t3 could go
34:09
here
34:13
can't fit here because if t3 was second
34:15
in the equivalent serial order then it
34:18
shouldn't see rights by t2 which comes
34:20
after it should see the value of Y
34:22
produced by t1 but it doesn't write it
34:25
see the value produced by t3 by t2 so
34:28
this is not an equivalent this serial
34:31
order wouldn't produce the same results
34:33
the only other one available to us is
34:35
this one this serial order would get the
34:39
same value for y that t3 actually
34:41
produced but if this was the serial
34:45
order then t3 should have seen the value
34:47
written by t2 but it actually saw the
34:49
valuable written by t1 so this execution
34:52
is not equivalent to any one at a time
34:55
searing the order so this is like
34:58
there's something broken about reads
35:03
simply reading the latest value so we
35:06
know that doesn't work you know what
35:08
we're really looking for of course is
35:09
that either the our our transaction
35:13
either reads the both values at this
35:16
point in time or it reads both values at
35:20
this point in time okay so the approach
35:31
that span our taste to this it's a
35:36
somewhat complex the first big idea is
35:40
an existing idea
35:42
it's called snapshot isolation
35:52
and the way I'm gonna describe this is
35:59
that let's imagine that all the
36:01
computers involved had synchronized
36:04
clocks that is you know they all have a
36:06
clock the clock wields yields us or wall
36:09
clock time like oh it's 143 in the
36:13
afternoon on April 7th 2020 so that's
36:17
what we mean by a wall clock time a time
36:20
so it's assumed that all the computers
36:21
assume even though this isn't true that
36:25
all the computers involved have
36:26
synchronized times furthermore let's
36:29
imagine that every transaction is
36:32
assigned a particular time a time stamp
36:37
and
36:42
time stamps their wall clocks times
36:46
taken from these synchronized clocks for
36:48
readwrite transaction its timestamp is
36:52
I'm going to say just for this for this
36:55
simplified design is the real time at at
36:58
the commit
37:01
and for read for a or at the time at
37:06
which the transaction manager starts the
37:08
commit and for read-only transaction the
37:11
timestamp is equal to the start time all
37:17
right so every turns out
37:18
time and we're gonna design our system
37:22
or a snapshot isolation system gets is
37:25
designed to execute as if to get the
37:28
same results as if all the transactions
37:31
had executed in timestamp order so we're
37:34
going to assign the transactions each
37:35
transaction a timestamp and then we're
37:37
going to arrange the executions so that
37:40
the transactions gets the results as if
37:42
they had executed in that order so given
37:45
the timestamps we sort of need to have
37:46
an implementation that will kind of
37:49
easily honor the timestamps and
37:51
basically you know show each transaction
37:53
the data sort of as it existed at its
37:57
timestamp okay so the way that this
38:01
works for read-only transactions is that
38:09
each replica when it stores data it
38:12
actually has multiple versions of the
38:13
data so we have a multiple version
38:19
database every database record has you
38:24
know maybe if it's been written a couple
38:26
times it has a separate copy of that
38:27
record for each of the times it's been
38:29
written
38:30
each one of them associated with the
38:33
timestamp of the transaction that wrote
38:35
it and then the basic strategies that
38:42
read only transactions when they when a
38:45
read-only transaction does a read it's
38:47
already allocated itself a timestamp
38:49
when it started and so it accompanies
38:52
its read request with its timestamp and
38:55
the whatever server that stores the
39:00
replicas of the data that the
39:02
transaction needs it's going to look
39:04
into its multi version database and find
39:06
the record that's being asked for that
39:10
as the highest time that's still less
39:12
than the timestamp specified by the
39:16
read-only transaction so that means to
39:17
be the only transaction sort of sees
39:19
data that is data as of the time as up
39:23
it's time jozin timestamp okay so this
39:28
is for this snapshot isolation idea
39:32
works for read-only transactions or
39:34
spanner uses it for read-only
39:36
transactions spinner users still uses
39:40
two-phase locking and two-phase commit
39:42
for readwrite transactions and so the
39:45
readwrite transactions allocate
39:47
timestamps for themselves a commit time
39:48
but other than that they work in the
39:50
usual way with locks and two-phase
39:52
commit where's the read-only
39:54
transactions access multiple versions in
39:58
the database and get the version that's
39:59
you know written by the has the
40:03
timestamp
40:04
that's highest that's still less than
40:05
the read-only transactions times date
40:07
and where this is going to get us is
40:09
that you know read-only transactions
40:11
will see all the rights of readwrite
40:14
transactions with lower timestamps and
40:16
none of the rights of read/write
40:18
transactions with higher tyst timestamps
40:21
okay so how would
40:24
isolation work out for our example
40:33
example that I had here before in which
40:37
we had a failure of serial
40:39
serializability because reading
40:44
transaction read before I read values
40:51
that were not between any two other be
40:53
bright transactions okay so this is an
40:56
our example but with snapshot isolation
41:01
I'm showing you this to show that the
41:04
snapshot isolation technique solves our
41:08
problem causes the read-only transaction
41:11
to be serializable so again we have
41:15
these two readwrite transactions t1 and
41:18
t2 and we have our transaction that's a
41:20
read-only transaction t1 and t2 right as
41:29
before they write and they commit but
41:36
now
41:36
they're allocating themselves timestamps
41:39
as of the commit time so in addition to
41:41
using two-faced command and two-phase
41:43
locking these read/write transactions
41:44
allocate a timestamp so let's imagine
41:46
that at the time of the commit T one
41:49
looked at the clock and saw that it the
41:52
time was ten I'm gonna use times of ten
41:54
and twenty and whatnot but you know you
41:57
should imagine times as being real times
41:59
like four o'clock in the morning on a
42:01
given day so let's say that T one sees
42:05
the time as 10 when it committed and T 2
42:09
sees that the commit time the time was
42:11
20 so I'm gonna write these transactions
42:14
chosen timestamp after the @ sign then
42:18
the database storage systems the span
42:22
our storage systems are going to store
42:25
when transaction 1 does its writes
42:27
they're gonna store a new sort of not
42:29
instead of overwriting in the current
42:31
value they're just gonna add a new copy
42:33
of this record with the timestamp so
42:35
it's gonna the database is going to
42:37
store away a new record this says the
42:39
value of x at time 10 is whatever it
42:43
happens to be let's say 9 the value of
42:46
record Y at time 10 is C 11 maybe we're
42:51
doing a transfer from X to Y similarly C
42:56
2 chose timestamp of 20 because that was
42:58
the real time at commit time and the
43:00
database is gonna remember a new set of
43:02
Records in addition these old ones it's
43:04
gonna say X at time 20 maybe we did a
43:10
another transfer from X to Y and Y at
43:14
time 20 equals 12 oh so now we have two
43:18
copies of each record at different times
43:19
now transaction 3 is gonna come along
43:21
and again it starts at about this time
43:25
and does a read of X and again it's
43:27
gonna be slow so you know it's not gonna
43:30
get around to reading wine till much
43:31
later much later in real time
43:35
however when transaction 3 started it
43:38
chose a timestamp by looking at the
43:40
looking at the current time and so let's
43:43
say since we know in real time that
43:45
transaction 3 started after transaction
43:48
one on before transaction 2
43:50
no it's got to have chosen a transaction
43:52
time somewhere between 10 and 20 and
43:55
let's suppose it started it time 15 and
43:59
chose timestamp 15 for itself so that
44:02
mean when it does the read of X it's
44:05
gonna send a request the local replica
44:09
that holds X and it's gonna accompany it
44:11
with it it's time stamp of 15 it's gonna
44:13
say please give me the latest data as of
44:15
time 15 of course transaction 2000
44:19
executed yet and but nevertheless the
44:21
highest time stamp copy of X is the one
44:26
from time 10 written by transaction 1 so
44:29
we're gonna get 9 for this one time
44:33
passes transaction 2 commits now
44:35
transaction 3 does the second read again
44:37
at a company suit the read requests with
44:39
its own time stamp of 15 Center the
44:42
server's now the server's have to
44:43
records but again because the server
44:46
gets transaction threes time stamp of 15
44:48
it looks at its records and say ha 15
44:51
sits between these two I'm gonna return
44:53
the highest time stamp record for X for
44:56
y it's less than the requested time
44:59
stamp and that's still the version of Y
45:02
from time 10 so the read of Y will
45:04
return at 11
45:06
that is the read of X essentially
45:09
happens at this time but because we
45:11
remembered a time stamp and we have the
45:13
database keep data as of different times
45:17
it was written
45:17
it's as if both reads happened the time
45:21
15 instead of one at time 15 and one
45:25
later and now you'll see that in fact
45:29
this just essentially emulates a serial
45:33
one at a time execution in which the
45:35
order is timestamp order transaction 1
45:38
and transaction - sorry then transaction
45:41
3 then transaction 2 that is a serial
45:46
order that is equivalent to that was
45:48
also actually produced is the time stamp
45:50
order of 10 15 20
45:55
alright okay so that's a simplified
46:01
version of what spanner does for really
46:05
transactions there's more complexity
46:09
which I'll get to in a minute
46:11
one question you might have is why it
46:15
was okay for transaction 3 to read an
46:17
old value of y that is it issued this
46:20
read of Y at this point in time the
46:23
freshest data for why was this value 12
46:27
but the value would actually got was
46:29
intentionally a stale value not the
46:32
freshest value but the value from a
46:34
while ago this value 11 so why is that
46:37
okay why is it okay not to be using the
46:39
freshest version of the data and the
46:45
kind of technical justification for that
46:47
is that transaction 2 and transaction 3
46:50
are concurrent that is the overlap in
46:53
time so those sort of time extent of
46:55
transaction 2 is here and the time
46:58
extent of transaction 3 is here they're
47:00
concurrent and the rules for linearise
47:03
ability and external consistency or that
47:05
if two transactions are concurrent then
47:09
the serial order that the database is
47:12
allowed to use can be can put the two
47:15
transactions in either order and here
47:17
the database spanner has chosen to put
47:20
transaction 3 before transaction 2 in
47:22
the serial order okay Robert we we have
47:28
a student question does external
47:30
consistency like with timestamps always
47:32
imply a strong consistency I'm
47:39
yes yes I think so
47:43
if strong consistency strong consistency
47:46
usually what people mean by that is
47:48
linearise ability and I believe the
47:51
definition of linearise ability and
47:53
external consistency are the same so I
47:57
would say yes and another question how
48:01
does this not absolutely blow up storage
48:03
that is a great question and the answer
48:05
is it definitely blows up storage and
48:09
the reason is that now the storage
48:12
system has to keep multiple copies data
48:16
records that have been recently modified
48:17
multiple times and that's definitely
48:20
expense both both this cost and storage
48:23
and space on the disk in the memory and
48:26
also it's just like an added layer of
48:28
bookkeeping you know now lookups have to
48:30
consider the timestamps as well as keys
48:34
the storage expense I think is not as
48:39
great as it could be because the system
48:41
discards old records that paper does not
48:44
say what the policy is but presumably
48:49
well it must be discarding old records
48:51
certainly if the only reason for the
48:53
multiple records is to implement
48:54
snapshot isolation of these kinds of
48:57
transactions then you don't really need
48:59
to remember values too far in the past
49:02
because you only need to remember values
49:06
back to the sort of earliest time that a
49:08
that a transaction could have started at
49:11
that's still running now and if your
49:13
transactions mostly you're always finish
49:15
or force the finish by killing them or
49:18
something within say one minute if no
49:21
transaction can take longer than a
49:22
minute then you only have to remember
49:23
the last minute of versions in the
49:26
database now in fact the paper implies
49:29
that they
49:30
day two farther back than that because
49:32
it appears they support intentionally
49:36
support these snapshot reads which allow
49:39
them to support the notion of seeing you
49:42
know data from a while ago you know
49:44
yesterday or something but they don't
49:46
say but but the garbage collection
49:49
policy is for old values so I don't know
49:52
how expensive it would be for them okay
49:58
okay so the the justification for ice
50:02
legal is that in external consistency
50:03
that the only rule that external
50:06
consistency imposes is that if one
50:09
transaction has completed then a
50:11
transaction that starts after it must
50:13
see its rights so t1 may be t1 completed
50:16
let's say that t1 completed at this time
50:18
and t3 started just after it may be
50:23
external consistency but demand that t3
50:25
sees key ones rights but since c2
50:28
definitely didn't finish before t3
50:30
started we have no obligation under
50:32
external consistency forty-three to see
50:34
teachers rights and indeed in this
50:38
example it does not so it's actually
50:40
legal um okay another problem that comes
50:46
up is that the transaction T 3 is needs
50:51
to read data as of a particular
50:52
timestamp but you know the reason why
50:56
this is desirable is that were it allows
50:58
us to read from the local replicas in
51:00
the same data center but maybe that
51:02
local replica is in the minority of
51:05
paxos followers that didn't see the
51:08
latest log records the leader so maybe
51:11
our local replicas maybe it's never even
51:13
seen you know never saw these rights to
51:16
X&Y at all it's still back at a version
51:19
from pine you know five or six or seven
51:22
and so if we don't do something clever
51:25
when we ask for the sort of highest
51:28
version record you know less than
51:31
timestamp 15 we may get some much older
51:34
version that's not actually the you
51:36
produced by transaction one which were
51:38
required to see
51:41
so the way he spanner deals with this is
51:44
with our notion of safe time and the
51:52
scoop is that each replica remembers you
51:55
know it's getting log records from its
51:58
taxes leader and the log records it
52:02
turns out that the paper arranges so
52:03
that the leader sends out log records
52:05
and strictly increasing timestamp order
52:07
so a replica can look at the very last
52:11
log record it's gotten from its leader
52:12
to know how up to dated it so if I ask
52:17
for a value as of timestamp 15 but the
52:20
replica has only gotten log entries from
52:23
my pax was leader a few times stamp 13
52:26
the replicas gonna make us delay it's
52:28
not gonna answer until it's gotten a log
52:31
record with time stamped 15 from the
52:33
leader and this ensures that replicas
52:36
don't answer a request for a given
52:39
timestamp until they're guaranteed to
52:41
know everything from the leader up
52:43
through that time stamp so this may
52:45
delay this may delay the reads okay
52:58
so the next question I've been assuming
53:01
I assumed in this discussion that the
53:03
clocks and all the different servers are
53:05
perfectly synchronized so everybody's
53:07
clock says you know 1001 and 30 seconds
53:11
at exactly the same time but it turns
53:15
out that you can't synchronize clocks
53:19
that precisely you it's basically
53:27
impossible to get perfectly synchronized
53:29
clocks and the reasons are reasonably
53:35
fundamental so the topic is time
53:39
synchronization which is sort of making
53:41
sure clocks say the same real time value
53:44
different clocks read the same value the
53:53
I'll tell the sort of fundamental
53:57
problem is that time is defined as
53:59
basically the time it says on a
54:01
collection of highly accurate expensive
54:04
clocks in a set of government
54:05
laboratories so we can't directly read
54:07
them although we can know is that these
54:10
government laboratories can broadcast
54:12
the time in various ways and the
54:18
broadcast take time and so it's some
54:20
time later some possibly unknown time
54:22
later we hear these announcements of
54:24
what the time it's own you know it may
54:26
all hear these announcements at
54:27
different times due to varying delays so
54:32
I actually first don't want to consider
54:34
the problem of what the impact is if on
54:37
snapshot isolation if the clocks are not
54:43
synchronize which they won't be
54:52
okay so what if the clocks are
54:56
there's actually no problem at all for
54:58
the spanners readwrite transactions
55:00
because the readwrite transactions used
55:02
locks and two-phase commit they're not
55:04
actually using snaps out of a solution
55:05
so they don't care so the readwrite
55:07
transactions will still be serialized by
55:09
the lock the two-phase locking mechanism
55:11
so we're only interested in what happens
55:14
for an RF or read-only transaction so
55:18
let's suppose a read-only transaction
55:22
chooses a timestamp that is too large so
55:27
that is far in the future you know it's
55:29
now 12:01 p.m. and it chooses a
55:31
timestamp at C 1 o'clock p.m. so if a
55:39
transactions chosen timestamps too big
55:42
that's actually not that bad what it'll
55:46
mean is that it will do read requests
55:48
it'll send a read request to some
55:50
replicas the replicas would say wait a
55:52
minute you're you know your clock is
55:53
Farrer it's far greater your chime seems
55:56
far greater than the last log entry I
55:58
saw for my pax was leader so I'm gonna
56:00
make you wait until the PAX was at the
56:03
time and the log entries and the Paxos
56:04
leader catches up to the time you've
56:05
requested I'm only gonna respond then so
56:08
this is correct but slow the reader will
56:11
be forced away that's not the worst
56:16
in the world but what happens if we have
56:19
a read-only transaction and it's
56:21
timestamp is too small and this would
56:27
correspond to its clock being less
56:30
either set wrong so that it's said in
56:31
the past or maybe it was originally set
56:34
correctly but the clock its clock ticks
56:36
too slowly the problem with this this is
56:39
a obviously causes a correctness problem
56:41
this will cause a violation of external
56:42
consistency because the multi version
56:46
databases you'll give it a timestamp
56:47
that's far in the past say an hour ago
56:50
and the database will read you a value
56:53
associated with it the timestamp from an
56:55
hour ago which may ignore more recent
56:58
writes so using a assigning a timestamp
57:01
to a transaction that's too small will
57:03
cause you to miss recent committed
57:05
writes and that's a violation of
57:11
external consistency so not externally
57:21
so so we actually have a problem here
57:24
the assumption that the clocks were
57:26
synchronized is in fact a very serious
57:29
assumption and the fact that you cannot
57:31
count on it means that unless we do
57:33
something the system is going to be
57:35
incorrect all right so so can we
57:43
synchronize clocks perfectly all right
57:46
that would be the ideal thing and if not
57:48
why not so so what about clock
57:51
synchronization the as I mentioned we're
57:59
done come from this it's actually a
58:01
collection of the kind of median of a
58:03
collection of clocks and government labs
58:06
the way that we hear about the time is
58:09
that it's broadcast by various protocols
58:11
sometimes by radio protocols like
58:13
basically what GPS is doing for spanner
58:16
is a GPS acts as a radio broadcast
58:19
system that broadcasts the current time
58:22
from some government lab through the GPS
58:24
satellites to GPS receiver sitting in
58:27
the Google machine rooms and there's a
58:31
number of other radio protocols like WWB
58:34
is another older radio protocol for
58:37
broadcasting the current time and
58:39
there's newer protocols like there's
58:41
this NTP protocol that operates over the
58:46
Internet that also is in charge of
58:48
basically broadcasting time so the sort
58:51
of system diagram is that there are some
58:54
government labs and the government labs
58:57
with their accurate clocks define a
58:59
universal notion of time that's called
59:02
UTC so we've UTC coming from some clocks
59:07
in some labs then we have some you know
59:09
radio internet broadcast or something
59:11
for the case of spanner it's the we can
59:19
think of the government allowed to
59:20
broadcasting to GPS satellites the
59:25
satellites in turn broadcast and the
59:28
broadcaster you know the millions of GPS
59:31
receivers that are out there
59:33
you can buy GPS receivers for a couple
59:37
hundred bucks that will decode the
59:38
timestamps in the in the GPS signals and
59:44
sort of keep you up to date with exactly
59:46
what the time is corrected for the
59:49
propagation delay between the government
59:51
labs and the GPS satellites and also
59:53
corrected for the delay between the GPS
59:55
satellites in your current position and
59:59
then there's in each data center there's
60:04
a GPS receiver that's connected up to
60:10
what the paper calls a time master which
60:14
is some server there's going to be more
60:16
than one of these for data center in
60:18
case one fails and then there's all the
60:21
hundreds of servers in the data center
60:22
that are running spanner either as
60:24
servers or as clients each one of them
60:27
is going to periodically send a request
60:33
saying aw what time is it to the local
60:34
one or more usually more than one piece
60:37
one feels to the time masters and the
60:41
time master will reply with oh you know
60:43
I think the current time has received
60:44
for GPS is such-and-such now built into
60:51
this unfortunately is a certain amount
60:52
of uncertainty and the primary sources
60:59
of uncertainty I think well there's
61:01
there's fundamentally uncertainty in
61:03
that we don't actually know how far we
61:05
are from the GPS satellites exactly so
61:08
the you know radio signals take some
61:10
amount of time even though the GPS
61:12
satellite knew exactly what time it is
61:14
those signals take some time to get to
61:15
our GPS receiver we're not sure what
61:17
that is that means that when the Jeep we
61:19
get a message from the radio message
61:22
from the GPS satellite saying exactly 12
61:25
o'clock
61:25
you know if the propagation delay might
61:28
have been you know a couple of
61:30
nanoseconds that mean that's there were
61:32
actually the propagation delays much
61:34
more than that it's really uncertainty
61:35
in the propagation delay means that
61:38
we're not really sure exactly whether
61:40
it's 12 o'clock or a little before a
61:41
little after in addition all the times
61:44
at time is communicated there's
61:46
did uncertainty that you have to account
61:49
for and the biggest sources are that
61:52
when a server sends requests after a
61:54
while it gets a response if the response
61:56
says it's exactly 12 o'clock but the
62:01
amount but um say a second pass you know
62:04
between when the server sent the request
62:06
and when I got the response all the
62:08
server knows is that even if the master
62:11
had the correct time all the server
62:13
knows is that the time is within a
62:18
second of 12 o'clock because maybe that
62:22
may be the request was instant but the
62:24
reply was delayed or maybe the request
62:27
was delayed by a second and the response
62:30
was the incident so all you really know
62:31
is that it's between you know 12 o'clock
62:34
and zero seconds and twelve o'clock and
62:38
one second okay so there's always this
62:45
uncertainty and in order to which we
62:49
really can't ignore though because the
62:50
uncertainties we're talking about
62:52
milliseconds here and we're gonna find
62:55
out that these that the uncertainty and
62:57
the time goes directly to the these how
63:00
long these safe waits have to be and how
63:02
long some other pauses have to be the
63:03
commit wait as we'll see so you know
63:08
uncertainty in the level of milliseconds
63:10
is a serious problem the other big
63:11
uncertainty is that each of these
63:13
servers only request the current time
63:15
from the master every once in a while
63:16
say every minute or however often and
63:20
between that the each server runs its
63:22
own local clock that sort of keeps the
63:25
time starting with the last time from
63:26
the master those local clocks are
63:28
actually pretty bad and can drift by
63:32
things by milliseconds between times
63:34
that the server talks to the master and
63:36
so the system has to sort of add the
63:40
unknown but estimated drift of the local
63:44
clock to the uncertainty of the time so
63:48
I'm in order to capture this uncertainty
63:50
and account for it
63:55
spanner uses this true time scheme in
64:00
which when you ask what time it is what
64:01
you actually get back as one of these TT
64:03
interval things which is a pair of an
64:12
earliest time and a latest earliest time
64:18
is their early early as the time could
64:21
possibly be and the second is the latest
64:25
the time can possibly be so when the
64:27
application you know makes this library
64:31
call that asked for the time it gets
64:32
back this payer all it knows is that the
64:34
current time is somewhere between
64:35
earliest and latest that's what you know
64:38
earliest might be in this case earliest
64:39
might be twelve o'clock and may this
64:41
might be twelve o'clock in one second
64:42
just just our guarantee that the that
64:46
the correct time isn't less than
64:48
earliest and isn't greater than latest
64:51
what we don't know where between
64:53
otherwise okay
64:57
so this is what uh when a transaction
65:01
asks the system what time it is this is
65:03
this is what the transaction actually
65:05
gets back from the time system and now
65:11
let's return to our original problem was
65:14
that if the clock was too slow that a
65:17
read-only transaction might read data
65:20
too far in the past and that it wouldn't
65:23
read data from a recent committed
65:25
transaction so we need to know what
65:27
we're looking for is how spanner uses
65:29
these TT intervals in its notion of true
65:32
time in order to ensure that despite
65:34
uncertainty in what time it is
65:36
transaction a external consistency that
65:40
is a read-only transaction it's
65:42
guaranteed to see writes done by a
65:45
transaction rate transaction that
65:47
completed before us and there are two
65:52
rules that the paper talks about that
65:55
conspire to enforce this and the two
66:01
rules which are in section 4-1 - one of
66:04
them is the start rule
66:07
and the other is commit wait
66:16
this note rule tells us what time stamps
66:20
trains actually what time stamps
66:23
transactions choose and basically says
66:26
that a transactions timestamp has to be
66:29
equal to the latest half of the true
66:35
time current time so this is T T now
66:38
call which returns one of those earliest
66:40
latest pairs that's the current time and
66:43
that transactions timestamp has to be
66:45
the latest that is it's going to be a
66:48
time that's guaranteed not to have
66:50
happened yet because the true time is
66:52
between earliest and latest and for a
66:54
read-only transaction it's a sign the
66:59
latest time as of its the time it starts
67:03
and for a read or write transaction is
67:06
to assign a timestamp this latest value
67:09
as of the time it starts to commit
67:16
okay so the start rule says this is how
67:18
spanner chooses time stamps the commit
67:21
weight rule only for readwrite
67:24
transactions says that when a
67:31
transaction coordinator is you know
67:35
collects the votes and sees that it's
67:36
able to commit and and chooses a time
67:39
stamp after it chooses this time stamp
67:41
it's required to delay to wait a certain
67:44
amount of time before til I have to
67:45
actually commit and write the values and
67:47
release locks so a readwrite transaction
67:52
has to delay until it's time stamps that
67:58
it chose when it was starting to think
68:00
about commit is less than the current
68:02
time the earliest
68:11
sorry
68:13
so what's going on here is the
68:14
sits in a loop calling TS now and it
68:17
stays in that loop until the timestamp
68:20
that it had chosen at the beginning of
68:21
the commit process is less than the
68:23
current times earliest half and what
68:25
this guarantees is that since now the
68:30
earliest possible correct time is
68:34
greater than the transactions timestamp
68:36
that means that when this loop is
68:38
finished when the commit wait is
68:39
finished this time stamp of the
68:41
transaction is absolutely guaranteed to
68:43
be in the past okay so how does the
68:49
system actually make use of these two
68:52
rules in order to enforce external
69:00
consistency for read-only transactions I
69:02
want to go back to our or I want to cook
69:09
up a someone simplified scenario in
69:14
order to illustrate this so I'm gonna
69:17
imagine that the writing transactions
69:18
only do one write each just reduce the
69:21
complexity let's say that there's two
69:24
read/write transactions so we have t0
69:27
and t1 are read/write transactions and
69:32
they both write X and we have a t2 which
69:35
is going to read X and we want to make
69:37
sure that t2 sees you know it's going to
69:40
use snapshot isolation on timestamps we
69:42
want to make sure that sees the latest
69:43
written value so we're going to imagine
69:48
that t2 does a write of X and writes one
69:51
to X and then commits we're going to
69:56
imagine that
69:58
sorry t1 write sex and come at t2 also
70:00
writes X writes a value 2 to X and we
70:05
need to distinguish between can prepare
70:07
and commit so we're going to say it it's
70:09
really a prepare that the transaction
70:11
chooses its timestamps so this is a
70:14
point at which it chooses timestamp and
70:16
it commits some time later and then
70:19
we're imagining by assumption that t2
70:21
starts after t1 finishes so it's going
70:24
to read X
70:27
afterwards and we want to make sure it
70:29
sees - all right so let's suppose that
70:34
t0 chooses a time stamp of one commits
70:40
writes the database let's say t1 starts
70:46
at the time it chooses a time stamp it's
70:49
gonna get some it's not get a single
70:51
number from the true time system really
70:53
gets a range of numbers you know
70:57
earliest and a latest value let's say at
71:02
the time it chooses its time stamp it
71:04
the range of values that earliest time
71:09
it gets is 1 and the latest field in the
71:12
current time is 10 so the rule says that
71:17
it must choose 10 the latest value as
71:20
its time stamp so t1 is gonna commit
71:22
with its time step 10
71:24
now you can't commit yet because the
71:27
commit weight rule says it has to wait
71:29
until it's time stamp is guaranteed to
71:32
be in the past so transaction 1 is going
71:35
to sit there keep asking what time is it
71:37
what time is it
71:37
until it gets an interval back that
71:41
doesn't include time 10 so at some point
71:45
it's gonna ask what time it is is gonna
71:48
get a time that we're the earliest
71:49
values 11 and elitist is I don't know
71:51
let's say 20 and now I was gonna say AHA
71:54
now I know that my time Sam it's
71:56
guaranteed to be in the past and I can
71:57
commit so t1 will actually this is its
72:00
commit wait period to sit there and wait
72:03
for a while before it commits okay now
72:07
after it commits transaction two comes
72:10
along a monster B Dex it's gonna choose
72:13
a time stamp also we're assuming that it
72:16
starts after t1 finishes because that's
72:19
the interesting scenario for external
72:21
consistency so let's say when it asks
72:23
for the time it asks at a time after
72:28
time 11 so it's going to get back an
72:30
interval that includes time 11
72:34
so let's suppose it gets back in a
72:35
little bit goes from time ten this is
72:39
the earliest and time twelve the latest
72:43
and of course the time twelve has to be
72:45
since we know that must be at least time
72:47
11 since transaction two started after
72:50
transaction one finished that means that
72:53
the 11th must be less than the latest
72:55
value transaction 2 is going to choose
72:59
this latest 1/2 as its timestamp so it's
73:02
gonna actually choose timestamp 12 and
73:09
in this example when it does its read
73:12
it's gonna ask the storage system oh I
73:15
want to read as of timestamp 12 since
73:18
transaction 1 wrote with timestamp 10
73:20
that means that you know assuming the
73:22
safe wait the safe time machinery works
73:25
we're actually gonna read the correct
73:27
value and what's going on here is that
73:33
the so this happened to work out but
73:37
indeed it's guaranteed to work out if
73:39
transaction 2 as long as transaction 2
73:41
starts after transaction 1 commits and
73:43
the reason is that commit weight causes
73:47
transaction 1 not to finish committing
73:49
until its timestamp is guaranteed to be
73:52
in the past
73:53
all right so transaction 1 chooses a
73:55
timestamp it's guaranteed to commit
73:59
after that timestamp transaction 2
74:05
starts after the commit it and so we
74:10
don't know anything about what its
74:11
earliest value will be but its latest
74:14
value is guaranteed to be after the
74:16
current time but we know that the
74:17
current time is after the commit time of
74:19
T 1 and therefore that teaches latest
74:23
value the timestamp it chooses is
74:26
guaranteed to be after when C committed
74:29
and therefore after the timestamp that C
74:34
used and because transaction 2 if
74:37
transaction 2 starts after T 1 finishes
74:40
transaction 2 is guaranteed to get a
74:42
higher timestamp
74:45
and the snapshot isolation machinery the
74:46
multiple versions will cause it to read
74:49
to it's read to see all lower valued
74:53
writes from all the lower time-stamped
74:55
transactions that means teach you is
74:56
going to see t1 nom and that basically
74:59
means that we're this this is how
75:01
spanner enforces external consistency
75:04
for its transactions so any questions
75:09
about this machinery alright um I'm
75:18
gonna step back a little bit there's
75:22
really from my point of view sort of two
75:25
big things going on here one is snapshot
75:27
isolation by itself snapshot isolation
75:30
by itself is enough to give you that
75:32
it's keeping the multiple versions and
75:35
giving every transaction a timestamp
75:36
snapshot isolation is guaranteed to give
75:38
you serializable read-only transactions
75:41
because basically what snapshot
75:43
isolation means is that we're going to
75:45
use these timestamps as the equivalent
75:47
serial order and things like the safe
75:50
wait the safe time ensure that read-only
75:54
transactions really do read as of their
75:57
time stamps see every readwrite
75:59
transaction before that and none after
76:01
that so there's really two pieces
76:04
snapshot isolation snapshot isolation by
76:08
itself though is actually often used not
76:11
just by spanner but generally doesn't by
76:14
a self guarantee external consistency
76:16
because in a distributed system it's
76:18
different computers choosing the
76:20
timestamp so we're not sure there's
76:22
timestamps will obey external
76:24
consistency even if they'll deliver
76:26
serialize ability so in addition to
76:29
snapshot isolation spanner also has
76:32
synchronized timestamps and it's the
76:34
synchronized timestamps plus the commit
76:37
weight rule that allow spanner to
76:40
guarantee external consistency as well
76:43
as serializability and again the reason
76:48
why all this is interesting is that
76:49
programmers really like transactions and
76:52
I really like external consistency
76:53
because that makes the applications much
76:55
easier to write
76:57
they traditionally not been provided in
76:59
distributed settings because they're too
77:01
slow and so the fact that spanner
77:03
manages to release make read-only
77:04
transactions very fast is extremely
77:08
attractive right no locking no two-phase
77:10
commit and not even any distant reads
77:12
for a read-only transactions they
77:14
operate very efficiently from the local
77:16
replicas and again this is what's good
77:19
for a basically attend factor of 10
77:22
latency improvement as measured in
77:25
tables 3 & 6 but just to remind you it's
77:29
not all it's not all fabulous the the
77:34
all this wonderful machine it really
77:36
only applies to read-only transactions
77:38
readwrite transactions still use
77:40
two-phase commit and locks and there's a
77:43
number of cases in which even spanner
77:45
will have the block like due to the safe
77:47
time and the commit wait but as long as
77:50
their times are accurate enough
77:53
these commit weights are likely to be
77:55
relatively small okay just to summarize
77:59
the spanner at the time was kind of a
78:03
breakthrough because it was very rare to
78:05
see deployed systems that operate
78:07
distributed transactions where the data
78:10
was geographically in very different
78:14
data centers I'm surprising you know
78:17
spanner people were surprised that
78:19
somebody was using a database that
78:21
actually did a good job of this and that
78:23
the performance was tolerable and the
78:26
snapshot isolation and a timestamp being
78:28
part of the probably the most
78:30
interesting aspects of the paper and
78:35
that is all I have to say for today any
78:40
last questions okay
78:49
I think on
78:51
we're gonna we're going to see farm
78:53
which is a sort of very different slice
78:57
through the desire to provide very high
79:00
performance transactions so I'll see you
79:05
on Thursday