字幕記錄


00:00
today the paper I'm going to discuss
00:04
this frangipani this is a fairly old
00:16
distributed file system paper the reason
00:18
why were reading it though is because it
00:21
has a lot of interesting and good design
00:23
having to do with cache coherence and
00:25
distributed transactions and distributed
00:28
crash recovery as well as the
00:31
interactions between them so those are
00:33
those are the really the ideas behind
00:35
this that we're gonna try to tease out
00:38
so these are it's really a lot of our
00:42
caching coherence is really the idea of
00:47
if I have something cached that
00:50
nevertheless if you modify it despite me
00:52
having a cache you know something will
00:54
happen so I can see your your
00:56
modifications and we also have
00:58
distributed its actions which are needed
01:04
internally to file systems to be able to
01:06
make complex updates to the file system
01:09
data structures and because the file
01:14
system is essentially split up among a
01:16
bunch of servers it's critical to be
01:19
able to recover from crashes and those
01:20
servers the overall design a friend
01:30
pammi it's a network file system it's
01:34
intended to look to existing
01:37
applications this is intended to work
01:39
with existing applications like UNIX
01:41
programs ordinary UNIX programs we're
01:43
running on people's workstations much
01:47
like Athena's AFS lets you get at your
01:51
Athena home directory and various
01:53
project directories from any Athena
01:56
workstation so the kind of overall
01:58
picture is that you have a bunch of
02:00
users each user in the papers world is
02:06
sitting in front of a workstation which
02:07
is you know real not a laptop in those
02:09
days but sort of computer with a
02:10
keyboard and display in a mouse and
02:12
Windows system at all so each one is
02:14
sitting
02:14
front of a computer workstation I'm
02:17
gonna call the workstations you know
02:19
workstation one more station to each
02:24
workstation runs an instance of the
02:27
frangipani server I meant so a huge
02:30
amount of the you know almost all of the
02:32
stuff that happens in this paper goes on
02:36
in the frangipani software in each
02:40
workstation so maybe they're sitting in
02:42
front of a workstation and they might be
02:44
running ordinary programs like a text
02:47
editor that's reading and writing files
02:48
and maybe when they finished editing a
02:50
source file they run it through the
02:51
compiler that the Reiser source file
02:53
when these ordinary programs make file
02:57
system calls inside the kernel there's a
02:59
frangipani module that implements the
03:08
file system inside of all of these
03:10
workstations each other on copy and then
03:15
the real storage of the file system data
03:18
structures things like certainly file
03:19
contents but also I nodes and
03:22
directories and alissa file and each
03:24
directory and the information about what
03:26
I knows and what blocks are free all
03:28
that's stored in a shared virtual disk
03:32
surface called petal it's on a separate
03:37
set of machines that are you know
03:38
probably server machines and a machine
03:40
room rather than workstations on
03:42
people's desks pedal among many other
03:45
things replicates data so you can sort
03:46
of think of pedal servers is coming in
03:48
pairs and one crashes we can still get
03:55
at our data and so when frangipani
03:58
needs to read or write a you know read a
04:00
directory or something it sends a remote
04:02
procedure call off to the correct pedal
04:04
server to say well here's the block that
04:06
I need you know please read it for me
04:08
please return that block and for the
04:10
most part petal is acting like a disk
04:13
drive you can think of it as a kind of
04:16
shared as a shared disk drive that all
04:22
these frangipane knees talk to and it's
04:25
called a virtual disk
04:31
from our point of view for most of this
04:34
discussion we're just going to imagine
04:35
pedal is just being a disk ride that's
04:37
used over the network by all these
04:39
friends of Handy's and so it has you
04:41
read and write it by giving it a block
04:44
number or an address on the disk and
04:46
seem like I'd like to read that block
04:47
just like an ordinary hard drive okay so
04:56
the intended use for this file so the
05:00
use that the authors intended is
05:02
actually reasonably important driver in
05:05
the design what they wanted was to
05:08
support their own activities and what
05:10
they were were were members of a
05:12
research lab of maybe say 50 people in
05:16
this research lab and they were used to
05:18
shared infrastructure things like time
05:20
sharing machines or workstations using
05:24
previous network file systems to share
05:27
files among cooperating groups of
05:29
researchers so they both wanted they
05:31
they wanted a file system that they
05:32
could use to store their own home
05:33
directories in as well as storing shared
05:37
project files and so that meant that if
05:39
I edit a file I'd really like the other
05:41
people my work and the other people I
05:43
work with to be able to read the file I
05:45
just edited so we want that kind of
05:47
sharing and in addition it's great if I
05:51
can sit down at any workstation my
05:52
workstation your workstation a public
05:54
workstation in the library and still get
05:57
at all of the files of all my home
05:59
directory everything I need in my
06:01
environment so they're really interested
06:03
in a shared file system for human users
06:07
in a relatively small organization small
06:11
enough that everybody was trusted all
06:12
the people all the computers so really
06:14
the design has essentially nothing to
06:16
say about security and indeed arguably
06:19
would not work in an environment like
06:21
Athena where you can't really trust the
06:23
users or the workstations so it's really
06:26
very much designed for their their
06:29
environment now as far as performance
06:34
their environment was also important you
06:36
know it turns out that the way most
06:37
people use computers are leased
06:39
workstations they sit in front of is
06:40
that they mostly read and write their
06:41
own files and they may read some shared
06:44
files you know programs or some project
06:48
files or something but most of the time
06:50
I'm reading and writing my files and
06:51
you're reading and writing your files on
06:53
your workstation and you know it's
06:55
really the exception that we're actively
06:57
sharing files so it makes a huge amount
06:59
of sense to be able to one way or
07:01
another even though officially the real
07:04
copies of files are stored in this
07:06
shared disk it's fantastic if we can
07:08
have some kind of caching so that after
07:11
I log in and I use my files for a while
07:12
they're locally cached here so they can
07:15
be gotten gotten that and you know
07:17
microseconds instead of milliseconds if
07:20
we have to fetch them from the file
07:22
servers ok so French Pyrenees supported
07:27
this this kind of caching furthermore it
07:29
supported right-back caching not only
07:33
caching in each in each workstation and
07:38
each frangipani server we also have
07:40
right back caching which means that if I
07:46
want to modify something if I modify a
07:49
file or even create a file in a
07:51
directory or delete a file or basically
07:53
do any other operation as long as nobody
07:55
else no other workstation needs to see
07:57
it
07:59
frangipani acts with a write back cache
08:01
and that means that my writes stay only
08:05
local in the cache if I create a file at
08:07
least initially the information about
08:09
the newly created file said a newly
08:11
allocated inode with initialized
08:13
contents and you know a new entry added
08:16
to a new name attitudes to my home
08:19
directory all those modifications
08:21
initially are just done in the cache and
08:23
therefore things like creating a file
08:25
can be done extremely rapidly they just
08:27
require modifying local memory in this
08:30
machine's disk cache and they're not
08:32
written back in general to peddle until
08:34
later so at least initially we can do
08:37
all kinds of modifications to the file
08:39
system at least to my own directories my
08:41
own files completely locally and that's
08:44
enormous ly helpful for performance it's
08:47
like a you know factor of a thousand
08:48
difference being able to modify
08:50
something in local memory versus having
08:52
to send a remote procedure calls to send
08:54
server now one serious consequence of
08:59
that it's extremely determinative of the
09:03
architecture here is that that meant
09:06
that the logic of the file system has to
09:09
be in each workstation in order for my
09:11
workstation to be able to implement
09:13
things like create a file just operating
09:15
out of its local cache it means all the
09:17
logic all the intelligence for the file
09:20
system has to be sitting here in my
09:22
workstation and in their design
09:24
basically to a first approximation the
09:26
pedal shared storage system knows
09:28
absolutely nothing about file systems or
09:31
files or directories all that logic this
09:34
is a very in a sense very
09:35
straightforward simple system and all
09:38
the complexity is here in the frangipani
09:41
in each client so it's a very kind of
09:46
decentralized scheme and one of the
09:49
reasons is - because that's what you
09:50
really need or these that was a design
09:53
they could think of to allow them to do
09:56
modifications purely locally in each
09:58
workstation it does have the nice side
10:00
effect though that I'm since most of the
10:03
complexity and most of the CPU time
10:05
spent is spent here it means that as you
10:07
add workstations as you add users to the
10:09
system you automatically get more CPU
10:13
capacity to run those new users file
10:16
system operations because most file
10:18
system operations happen just locally in
10:21
the workstation that's most of the CPU
10:23
time is spent here so the system does
10:25
have a certain degree of natural scaling
10:27
scalability as you add workstations each
10:30
new workstation is a bit more load from
10:32
a new user but it's also a bit more
10:33
available CPU time to run that users
10:35
file system operations of course at some
10:38
point you're gonna run out of gas here
10:40
in the central storage system and you
10:44
know then you may need to add more
10:46
storage servers to all right
10:54
so okay so we have the system that does
10:59
serious caching here and furthermore
11:01
does the modifications in the cache that
11:05
actually these immediately to some
11:06
serious challenges in the design and the
11:09
design is mostly about solving the
11:11
challenges I'm about to lay out these
11:15
are largely count challenges of that's
11:20
come from caching and this sort of
11:26
decentralized architecture where most of
11:28
the intelligence is sitting in the
11:30
clients so the first challenge is that
11:45
suppose workstation one creates a file
11:49
in you know maybe a file say /a a new
11:55
file /a and initially it just creates
11:59
this in its local cache so that you know
12:01
first it may need to fetch the current
12:03
contents of the slash directory from
12:05
petal nom but then when it creates a
12:06
file just modifies its cached copy and
12:10
doesn't immediately send it back to
12:11
peddle then there's an immediate problem
12:14
here suppose the user on workstation 2
12:17
tries to get a directory listing of the
12:19
directory slash right we'd really like
12:21
to be able to this user see the newly
12:24
created file right and that's what users
12:27
are gonna expect and users will be very
12:29
confused if you know person down the
12:31
hall from me created a file and said oh
12:33
you know I put all this interesting
12:34
information in this new file /a why
12:36
don't you go read it and then I try to
12:37
read it and it's totally not there so we
12:41
absolutely want very strong consistency
12:43
if the person down the hall says they've
12:45
done something in the file system I
12:46
should be able to see it and if I edit a
12:49
file on one work station and then maybe
12:51
compile it on a computer on another
12:54
computer I want the compiler to see the
12:56
modifications I just made to my file
12:58
which means that the file system has to
13:01
do something to ensure that readers see
13:05
even the most recent rights so we've
13:09
been talking about this as we've been
13:11
calling this you know strong strong
13:13
consistency and linearize ability before
13:15
and that's basically what we want in the
13:18
context of caches though like the issue
13:20
here is not really about the storage
13:22
server necessarily it's about the fact
13:24
that there was a modification here that
13:26
needs to be seen somewhere else and now
13:28
for historical reasons that's usually
13:30
called cache coherence that is the
13:38
property of a caching system that even
13:41
if I have an old version of something
13:43
cached if someone else modifies it in
13:46
their cache then my cache will
13:48
automatically reflect their
13:50
modifications so we want this cache
13:53
coherence property another issue you
13:58
have is that the you know everything all
14:01
the files and directories are shared we
14:03
could easily have a situation where two
14:05
different workstations are modifying the
14:09
same directory at the same time so
14:11
suppose again maybe the user one on
14:14
their workstation wants to create a file
14:16
/a which is a new file in the directory
14:18
slash in the new in the root directory
14:20
and at the same time user two wants to
14:24
create a new file called slash B so at
14:27
some level you know they're creating
14:29
different files alright a and B but they
14:32
both need to modify the root directory
14:33
to add a new name to the root directory
14:35
and so the question is even if they do
14:38
this simultaneously you know to file
14:40
creations of differently named files but
14:42
in the same directory from different
14:44
workstations will the system be able to
14:46
sort out these concurrent modifications
14:51
to the same directory and arrive at some
14:52
sensible result and of course the
14:54
sensible result we want is that both a
14:56
and B end up existing we don't want to
14:58
end up with some you know situation in
15:01
which only one of them ends up existing
15:04
because the second modification
15:06
overwrote and sort of superseded the
15:10
first modification
15:15
and so this is again it goes by a lot of
15:19
different names but we'll call it a de
15:21
Missa T we want operations such as
15:26
create a file to lead a file to act as
15:28
if they just are instantaneous
15:31
instantaneous and time and don't ever
15:34
therefore don't ever interfere with
15:37
operations that occur at similar times
15:39
by other workstations
15:41
well things to happen just at a point in
15:43
time and not be spread over even if
15:46
they're complex operations and involve
15:47
touching a lot of state we want them to
15:50
appear as if they occur instantaneously
15:54
at a final problem we have is suppose
16:00
you know my workstation is modified a
16:04
lot of stuff and maybe it's
16:05
modifications are or many of its
16:07
modifications are done only in the local
16:10
cache because of this right back caching
16:12
if my were station crashes after having
16:16
modified some stuff in its local cache
16:18
and maybe reflected some but not all
16:20
those modifications back to storage
16:22
pedal other workstations are still
16:27
executing and they still need to be able
16:29
to make sense of the file system so the
16:32
fact that my workstation crashed while I
16:34
was in the middle of something had
16:35
better not wreck the entire file system
16:38
for everybody else or even any part of
16:39
it so that means what we need is crash
16:45
recovery of individual servers we won't
16:49
be able to have my workstation crash
16:51
without disturbing the activity of
16:53
anybody else using the same shared
16:55
system even if they look at my directory
16:57
in my files they should see something
16:58
sensible maybe it won't include the very
17:00
last things I did but they should see a
17:02
consistent file system and not a rekt
17:06
file system data structure so we want
17:08
crash recovery
17:13
as always with distributed systems
17:16
that's made more complex because we can
17:18
easily have a situation where only one
17:20
of the servers crashes but the others
17:22
are running and again for all of these
17:27
things for all three of these challenges
17:30
they're really challenged we're in this
17:32
discussion their challenges about how
17:34
frangipani works and how these
17:36
frangipani
17:38
software inside the workstations work
17:40
and so when I talk about a crash I'm
17:42
talking about a crash of a workstation
17:43
and it's frangipani you know the pedal
17:46
virtual disk has many similar questions
17:50
associated with it but there are not
17:51
really the focus today it has a
17:55
completely separate set of R'lyeh fault
17:59
tolerance machinery built into pedal and
18:02
it's actually a lot like the chain
18:04
replication kind of systems we talked
18:06
about earlier ok so I'm going to talk
18:12
about each of these challenges in turn
18:15
the first challenge is cache coherence
18:22
and the game here is to get both the
18:29
benefits of both linearize ability that
18:32
is when I read when I look at anything
18:35
in the filesystem I always see fresh
18:36
data I always see the very latest data
18:38
so we got both linearize ability and
18:42
caching not caching that's good caching
18:45
as we can get for performance so somehow
18:48
we you know we need to get the benefits
18:50
of both of these and the kind of that
18:56
people implement cache coherence that is
18:59
using what are called cache coherence
19:01
protocols and it turns out these
19:02
protocols are used a lot in many
19:04
different situations not just
19:06
distributed file systems but also things
19:08
like the caches in multi-core the per
19:12
core caches in multi core processors
19:14
also use cache coherence protocols which
19:17
are going to be not unlike the protocols
19:20
I'm going to describe for frangipani all
19:23
right
19:23
so it turns out that frangipani x' cache
19:29
coherence is driven by its use of locks
19:32
and we'll see locks come up later in
19:34
both actually for both atomicity and
19:37
crash recovery but the particular use of
19:39
locks I'm going to talk about for now is
19:41
a use of blocks to drive cache coherence
19:43
to help workstations ensure that even
19:45
though they're caching data they're
19:46
caching the latest data so as well as
19:51
the frangipani servers and workstations
19:52
and pedal servers there's a third kind
19:55
of server in the frangipani system
19:59
there's lock servers and so we're I'm
20:02
just gonna pretend there's one lock
20:04
server although you could shard the
20:06
locks over multiple servers so here's a
20:10
lock server it's a separate you know
20:16
it's logically at least a separate
20:17
computer although I think they ran them
20:19
on the same hardware as the pedal
20:21
servers but it basically just has a
20:24
table of named locks and locks are named
20:29
we'll consider them to be named after a
20:32
named as after file names although in
20:34
fact they're named after I numbers so we
20:37
have for every file we have a lock
20:43
potentially and each lock is possibly
20:48
owned by some owner for this discussion
20:51
I'm just gonna assume I'm gonna describe
20:54
it as if the locks were exclusive locks
20:56
although in fact frangipani has a more
20:58
complicated scheme for locks that allow
21:01
either one writer or multiple readers so
21:03
for example maybe file X has recently
21:08
been used by workstation 1 and
21:10
workstation 1 has a lock on it and maybe
21:15
file Y is recently used by workstation 2
21:17
and workstation 2 has a lock on it and
21:20
the lock server will remember off or
21:21
each file
21:22
who has the lock if anyone maybe nobody
21:24
does on that file and then in each
21:28
workstation
21:29
each workstation keeps track of which
21:33
locks it holds and this is tightly tied
21:36
to it
21:37
I'm keeping track of cache data as well
21:38
so in each workstations frangipani
21:42
module there's also a lock table and
21:52
record what file the more session to
21:55
lock for what kind of lock it has and
21:59
the contents the cached contents of that
22:03
file so that might be a whole bunch of
22:04
data blocks or maybe directory contents
22:07
for example so there's a lot of content
22:10
here so Linda frangipani server decides
22:14
oh it needs to read it needs to use the
22:17
directory slash or look at the file a or
22:19
look at an inode it first gets asked the
22:22
lock server for a lock on whatever it's
22:24
about to use and then it asks petal to
22:26
get the data for whatever that file or
22:30
directory or whatever it is and it needs
22:32
to read and then the workstation
22:34
remembers oh ho you know I have a copy
22:36
of file X its content is whatever the
22:41
content of file X is cached and it turns
22:46
out that workstations can have a lock in
22:48
at least two different modes what the
22:51
workstation can be actively reading or
22:54
writing whatever that file or directory
22:56
is right now that it's in the middle of
22:59
a file creation operation or deletion or
23:01
rename or something so in that case I'll
23:05
say that the lock is held by the
23:09
workstation and is busy it could also be
23:12
after a workstation has done some
23:16
operation like create a file or maybe
23:17
read a file
23:18
you know then release the lock as soon
23:20
as it's done with that system call
23:22
whatever system call like rename or read
23:24
or write or create as soon as the system
23:25
calls over the workstation will give up
23:27
the lock at least internally it's not
23:31
actively using that file anymore but
23:34
it'll as far as the lock server is
23:36
concerned the workstation will hold the
23:37
lock but the workstation notes for it
23:39
its own use that it's not actively using
23:42
that lock anymore as well call that the
23:45
lock is still held by the workstation
23:49
I'm just but the work station isn't
23:54
really using it and that'll be important
23:57
in a moment okay so I think these two
24:01
are set up consistently if we assume
24:02
this is workstation one the lock server
24:05
knows Oh locks for x and y exists and
24:07
they're both held by workstation one
24:08
workstation one has equivalent
24:11
information in its table it knows it's
24:13
holding these two blocks and furthermore
24:15
it has the it's remembering the content
24:18
is cached for the filers of directories
24:20
that the two locks cover there's a
24:26
number of rules here that in that
24:28
frangipani follows that caused it to use
24:32
locks in a way that provide cache
24:35
coherence then sure nobody's ever
24:36
meaning using stale data from their
24:38
cache so so these are basically rules
24:46
that are using conjunction with the
24:48
locks and cache data so one the really
24:53
overriding invariant here is that no
24:58
workstation is allowed to cache data to
25:00
hold any cached data unless it also
25:02
holds the lock associated with that data
25:05
so basically it's no cache data without
25:13
a lock without the lock that protects
25:17
that data and operationally what this
25:21
means is a workstation before it uses
25:25
data it first acquires the lock on the
25:27
data from the lock server and after the
25:29
workstation has the lock only then does
25:32
the workstation read the data from petal
25:34
and put it and put it in its cache so so
25:40
the sequence is you can acquire a lock
25:41
and then read from petal
25:50
I'll tell you at the lock of course you
25:53
know you weren't passing the data you
25:55
want to catch the data you first got to
25:56
get the lock and only strictly
25:58
afterwards read from petal and if you
26:02
ever release a lock then the rule is
26:05
that before releasing a lock you first
26:08
have to write if you modified the lock
26:10
data in your cache before you release
26:13
the lock you have to write the data back
26:15
to modify data back to petal and then
26:18
only when petals as yes I got the data
26:20
only then you'll have to release the
26:22
lock that is gives a lock back to the
26:23
lock server so the sequence is always
26:27
first you write the cache dated a petal
26:31
storage system and then release the lock
26:40
and erase the entry whoops
26:43
erase the entry and the cat and the
26:45
cache data from your from that
26:47
workstations lock table what this
26:52
results in the the protocol between the
26:55
lock server and between the workstations
26:57
and the lock server consists of four
27:01
different kinds of messages this is the
27:04
coherence protocol these are just
27:11
network you can think of them as
27:13
essentially sort of one-way Network
27:14
messages there's a request message from
27:20
from workstations to the lock server
27:25
request message says oh hey lock server
27:27
I'd like to get this lock when the lock
27:30
server is willing to give you the lock
27:34
and of course if somebody else holds if
27:35
the lock server can't immediately give
27:37
you the lock but if when the lock
27:39
becomes free the lock server will
27:41
respond we have a grant message then the
27:46
lock server back to the workstation in
27:49
response to an earlier request well if
27:51
you request a lock for the lock server
27:53
and someone else holds the lock right
27:55
now that other workstation has to first
27:58
give up the lock we can't have two
27:59
people owning the same lock so how are
28:02
we going to get that works
28:03
the lock well what I said here is that
28:07
when a lock station is you know when
28:08
it's actually using the lock and
28:09
actively reading or writing something it
28:11
has the lock and it's marked it busy but
28:13
the workstations don't give up their
28:15
locks ordinarily when they're done using
28:18
them so if I if I create a file and then
28:21
create system call finishes I'll still
28:24
have that file that new file locked and
28:26
also own the lock for that my
28:28
workstation will still all in the lock
28:29
for that file it'll just be in state
28:31
idle instead of busy but as far as the
28:33
lock server is concerned
28:34
well my workstation still has the lock
28:36
and the reason for this the reason to be
28:38
lazy about handing locks back to the
28:40
lock server is that if I create a file
28:42
called Y on my workstation
28:43
I'm almost certainly going to be about
28:46
to use Y for other purposes like maybe
28:48
write some data to it or read from it or
28:51
something so it's extremely advantageous
28:53
for the workstation to sort of
28:55
accumulate locks for all of the recently
28:58
used files in the workstation and not
29:00
give them back unless it really has to
29:02
and so in the ordinary in the common
29:05
case in which I use a bunch of files in
29:07
my home directory and nobody else on any
29:09
other workstation ever looks at them my
29:11
workstation ends up accumulating dozens
29:14
or hundreds of locks in idle state for
29:16
my files but if somebody else does look
29:18
at one of my files they need to first
29:20
get the lock and I have to give up the
29:22
lock so the way that works is that if
29:25
the lock server receives a lock request
29:27
and it sees in the lock server table AHA
29:30
you know that lock is currently owned by
29:31
workstation 1 the lock server will send
29:34
a revoke message to whoever the
29:38
workstation that currently owns that
29:40
lock saying look you know somebody else
29:42
wants it please give up the lock when a
29:47
workstation receives a revoke request if
29:49
the lock is idle then if the cache data
29:54
is dirty the workstation will first
29:56
write the cat dirty data that modified
30:00
data from his cache back to peddle
30:01
because the rule says the rule that in
30:04
order to never cache data without a lock
30:06
says we got our right the modify dated
30:08
back to peddle before releasing so if
30:11
the locks idle would first write back
30:13
the data if it's modified back to peddle
30:16
and
30:17
then send a message back to the lock
30:22
server saying it's okay we give up this
30:24
lock so the response to revoke send to a
30:35
workstation is the worst station sends
30:37
it released of course if the worst
30:39
station gets a revoke while it's
30:40
actively using a lock while it's in the
30:42
middle of a delete or rename or
30:44
something that affects the locked file
30:49
the worst station will not give us a
30:51
lock until it's it's done using and
30:53
until it's finished that file system
30:55
operation whatever system call it was
30:56
that was using this file and then the
30:58
lock in the worst stations lock state
31:00
will transition to idle and then you'll
31:03
be able to pay attention to the revoke
31:07
request and after writing to peddle if
31:10
need be released the lock alright so
31:12
this is the is the coherence protocol
31:17
that fringe that well this is a
31:21
simplification of the coherence protocol
31:23
that frangipani uses as I mentioned
31:24
before what's missing from all this is
31:26
the fact that locks can be either
31:28
exclusive for writers or shared for
31:31
read-only access and just like petal is
31:38
a block server and doesn't understand
31:41
anything about file systems the lock
31:44
server also these IDs these are really
31:47
lock identifiers and the locks are
31:49
doesn't know anything about files or
31:51
directories or file system it just has
31:53
these it's just has this table with
31:55
opaque IDs and who owns you know that
31:58
name locks and who owns those locks and
32:01
it's frangipani that knows ah you know
32:03
the lock that I associate was he given a
32:05
file has such and such an identifier and
32:08
as it happens prin Japan uses unix-style
32:11
I numbers or the numbers associated with
32:14
files instead of names for locks so just
32:20
to make this coherence protocol concrete
32:25
and to illustrate again the relationship
32:28
between petal operations
32:31
and lock server operations let me just
32:33
run through what happens if one
32:35
workstation modifies some file system
32:37
data and then in another workstation
32:40
means to look at it so we have two
32:43
workstations the lock server so the way
32:50
the protocol plays out if workstation
32:52
one wants to read since a workstation
32:56
one wants to read and then modify files
32:58
e so before it can even read anything
33:02
about Z from peddle it must first
33:06
acquire the lock for Z so it sends an
33:11
acquire request to the lock server maybe
33:13
nobody holds the lock or lock servers
33:15
never heard anything about it
33:16
so the locks are makes a new entry for Z
33:19
and it stable returns our reply saying
33:21
yes
33:24
you own the grant for lock C and at this
33:33
point the workstation says it has the
33:34
lock on file Z isn't entitled to read
33:37
information about it from petal so at
33:40
this point we're gonna read Z from petal
33:52
and indeed workstation one can modify it
33:55
locally in their cache at some later
33:57
point maybe the human being and sitting
33:59
in front of workstation two wants to
34:01
also to read file Z while the
34:04
workstation two doesn't have the lock
34:06
for files the ISA the very first thing
34:07
it needs to do is send a message the
34:09
lock server saying oh yeah I'd like to
34:12
get the lock for file Z the lock server
34:16
knows it can't reply yes yet because
34:18
somebody else has the lock namely
34:20
workstation one my the lock server sends
34:23
in response a revoke
34:30
the workstation one workstation one not
34:34
allowed to give up the lock until it
34:35
writes any modified data back to the
34:37
pedal so it's now gonna write the model
34:42
anything modified content the actual
34:45
contents of the file with always
34:46
modified back to pedal only then is
34:51
workstation two allowed to send a
34:54
release back to the lock server the lock
35:02
server with must have kept a record in
35:04
some table saying well you know there's
35:05
somebody waiting for lock Z as soon as
35:07
its current holder releases that we need
35:10
to reply and so this receipt of this
35:14
release will cause the lock server to
35:17
update its tables and finally send the
35:19
grant back to or station two and at this
35:26
point now our station two can finally
35:28
read files even pedal this is how the
35:36
cache coherence protocol plays out to
35:39
ensure that everybody who does a read
35:43
doesn't read the data until whoever the
35:46
previous until anybody who might have
35:49
had the data modified privately in their
35:52
cache first writes the data back to
35:54
pedal all right so the locking machinery
36:01
forces reads to see the latest right so
36:04
what's going on there's a number of the
36:12
optimizations that are possible in these
36:14
kind of cache coherence protocols
36:16
I mean I've actually already described
36:18
one this idle state the fact that
36:20
workstations hold onto locks that
36:22
they're not using right now instead of
36:23
immediately releasing them that's
36:25
already an optimization to the simplest
36:28
protocol you can think of and the other
36:30
main optimization is that the frangipani
36:33
has is that it has a notion of shared
36:36
versus shared read locks versus
36:39
exclusive write locks so have lots and
36:41
lots of workstations need to be
36:42
the same file but nobody's writing it
36:44
they can all have a lock a read lock on
36:47
that file and if somebody does come
36:49
along and try to write this file that's
36:51
widely cached they first need to first
36:54
revoke everybody's read lock so that
36:57
everybody gives up their cached copy and
36:59
only then is a right or allowed to write
37:01
the file but it's okay now because
37:03
nobody has a cache copy anymore so
37:05
nobody could be reading stale data while
37:08
it's being written all right so that's a
37:13
cache coherence story driven by driven
37:21
by the locking protocol next up in our
37:26
list of yes yes that's a good question
37:35
in fact there's a risk here in the
37:42
scheme I described that if I modify a
37:44
file on my workstation and nobody else
37:47
reads it for nobody else reads it that
37:50
the only copy of the modified file maybe
37:53
have some precious information in it is
37:55
on in in the cache in RAM on my
37:58
workstation and my works they were to
38:00
crash then and you know we hadn't done
38:03
anything special then it would have
38:05
crashed with the only copy of the data
38:07
and the data would be lost so in order
38:09
to forestall this no matter what all
38:12
these workstations write back anything
38:15
that's in their cache any modified stuff
38:18
in their cache every 30 seconds so that
38:21
if my workstation crash is unexpectedly
38:23
I may lose the last 30 seconds at work
38:25
but no more there's actually just mimics
38:27
the way ordinary Linux or UNIX works
38:33
indeed all of this a lot of the story is
38:36
about in the context of a distributed
38:40
file system trying to mimic the
38:43
properties that ordinary unix-style
38:46
workstations have so that users won't be
38:49
surprised by frangipani it just sort of
38:51
works much the same way that they're
38:53
already used
38:57
all right so our next challenge is how
39:00
do you atomicity that is how to make it
39:04
so even though when I do a complex
39:05
operation like creating a file which
39:07
after all involves marking a new I
39:10
knowed as allocated initializing the
39:14
inode the I knows a little piece of data
39:15
that describes each file maybe
39:16
allocating space for the file adding a
39:19
new name in the directory for my new
39:21
file there's many steps so many things
39:23
that have to be updated we don't want
39:25
anybody to see any of the intermediate
39:27
steps we want people you know other
39:30
workstations to either see the file not
39:32
exist or completely exist but not
39:34
something in between one atomic
39:41
multi-step operations alright so in
39:56
order to implement this in order to make
39:58
multi-step operations like file create
40:01
or rename or delete atomic as far as
40:04
other workstations are concerned
40:05
frangipani has a implement the notion of
40:08
transactions that is as a complete sort
40:13
of database style transaction system
40:15
inside it again driven by the locks
40:20
furthermore it it's it's this is
40:22
actually distributed transaction system
40:26
and we'll see more we'll hear more about
40:28
distributed transaction systems later in
40:31
the course
40:31
there are like a very common requirement
40:34
and distributed systems the basic story
40:39
here is that frangipani makes it so that
40:43
other workstations can't see my
40:45
modifications until completely done by
40:47
an operation by first acquiring all the
40:50
locks on all the data that I'm going to
40:52
need to read or write during my
40:54
operation and not releasing any of those
40:57
locks until it's finished with the
41:00
complete operation and of course
41:02
following the coherence rule written all
41:05
of the modified data back to petal
41:08
so before I do an operation like
41:10
renaming like moving a file from one
41:12
directory to another which after all
41:13
modifies both directories and I don't
41:16
want anybody to see the file being in
41:18
either directory or something in the
41:20
middle of the operation in order to do
41:22
in order to do this French penny first
41:25
acquires require all the locks for the
41:31
operation then do everything like all
41:39
the updates right the frangipani so I
41:47
write to pedal and then release and of
41:55
course this is easy button and you know
41:57
since we already had the locking server
41:59
anyway in order to drive the cache
42:00
coherence protocol we buy just by you
42:04
know making sure we hold all the locks
42:05
for the entire duration of an operation
42:07
we get these indivisible atomic
42:10
transactions almost for free so an
42:18
interesting thing to know and that's
42:19
basically all there is to say about
42:20
making operations atomic and transit
42:23
Pandu's hold all the locks an
42:26
interesting thing about this use of
42:28
locks is that trends of pennies using
42:29
locks for - almost opposite purposes for
42:33
cache coherence frangipani uses the
42:36
locks to make sure that writes are
42:38
visible immediately to anybody who wants
42:41
to read them so this is all about using
42:43
locks essentially to kind of make sure
42:46
people can see writes this use the
42:49
blocks is all about making sure people
42:51
don't see the writes until I'm finished
42:53
with an operation because I hold all the
42:57
locks until all the rights have been
42:59
done so they're sort of playing an
43:01
interesting trick here by reusing the
43:04
locks they would have had to have anyway
43:06
for transactions in order to drive cache
43:09
coherence
43:12
all right so the next interesting thing
43:14
is crash recovery we need to cope with
43:24
the possibility the most interesting
43:27
possibility is that a workstation
43:29
crashes while holding locks and while in
43:33
the middle of some sort of complex set
43:35
of updates that is a reforestation
43:37
acquired a bunch of locks it's writing a
43:39
whole lot of data to maybe create or
43:40
delete files or something has possibly
43:42
written some of those modifications back
43:45
to pedal because maybe it was gonna soon
43:48
release locks or had been asked by the
43:50
lock server to release locks so it's
43:52
maybe done some of the rights back to
43:54
pedal for its complex operations but not
43:57
all of them and then crashes before
44:00
giving up the locks so that's the
44:02
interesting situation for crash recovery
44:07
so there's a number of things that that
44:09
don't work very well for workstation
44:11
crashes crashing we hope one thing that
44:30
doesn't work very well is to just
44:32
observe the workstations crashed and
44:35
just release all its locks because then
44:37
if it's done something like created a
44:41
new file and it's written the files
44:43
directory entry its name back to pedal
44:47
but it hasn't yet written the
44:48
initialized inode that describes the
44:51
file the inode may still be filled with
44:53
garbage or the previous file some
44:56
previous files information in petal and
44:58
yet we've already written the directory
45:00
entry so it's not okay to just release a
45:02
crashed file servers release of crash
45:05
were stations locks another thing that's
45:11
not okay is to not release the crashed
45:13
workstations locks you know that would
45:15
be correct because you know if it
45:17
crashed while in the middle of writing
45:20
out some of this modifications the fact
45:23
that it hadn't written out all of them
45:24
means a can't of release its locks
45:26
so simply not releasing its locks is
45:28
correct because it would hide the this
45:31
partial update from any readers and so
45:33
nobody would ever be confused by seeing
45:35
partially updated data structures in
45:38
petal on the other hand you know then
45:41
anybody you needed to use those files
45:42
would have to wait forever for the locks
45:44
if we simply didn't give them up so we
45:47
absolutely have to give up the locks in
45:48
order that other workstations can use
45:51
the system can use those same files and
45:53
directories but we have to do something
45:55
about the fact that the workstation
45:57
might have done some of the rights but
45:59
not all for its operations so frangipani
46:05
has like almost every other system that
46:10
needs to implement crashed recoverable
46:12
transactions users right ahead logging
46:22
this is something we've seen at least
46:25
one instance of the last lecture with
46:29
with aurora i was also using right ahead
46:33
logging so the idea is that if a
46:39
workstation needs to do a complex
46:41
operation that involves touching
46:43
updating many pieces of data in petal in
46:46
the file system the workstation well
46:48
first before it makes any rights to
46:51
petal append a la log entry to his log
46:57
in petal describing the full set of
47:00
operations it's about to do and only
47:03
when that log entry describing the full
47:07
set of operations is safely in petal
47:09
where now anybody else can see it only
47:11
then will the workstation start to send
47:15
the rights for the operation out to
47:17
petal I'm so we if it were station could
47:20
ever reveal even the one of its rights
47:23
for an operation the petal it must have
47:26
already put the log entry describing the
47:29
whole operation all of the updates must
47:32
already exist in petal so this is very
47:35
standard this is just a description of
47:37
right ahead logging
47:39
but there's a couple of odd aspects of
47:43
how frangipani implements right ahead
47:47
logging
47:48
the first one is that in most
47:51
transaction systems there's just one log
47:54
and all the transactions in the system
47:57
you know they're all sitting there in
47:59
one log in one place so there's a crash
48:02
and there's opera there's more than one
48:05
operation that affects the same piece of
48:07
data we have all of those operations for
48:10
that piece of data and everything else
48:11
right there in the single log sequence
48:14
and so we know for example which is the
48:16
most recent update to a given piece of
48:20
David but frangipani doesn't do that
48:22
this it has her work station logs as one
48:28
log per work station and there's
48:30
separate logs the other very interesting
48:34
thing about frangipane ease logging
48:36
system is that the LA workstation logs
48:39
are stored in petal and not on local
48:42
disk in almost every system that uses
48:44
logging the log is tightly associated
48:46
with whatever computer is running the
48:48
transactions that it's almost always
48:50
kept on a local disk but for extremely
48:54
good reasons
48:56
frangipani workstations
48:59
store their logs in petal in the shared
49:01
storage each workstation had its own
49:03
sort of semi-private log but it's stored
49:06
in petal storage where if the
49:09
workstation crashes its log can be
49:11
gotten that by other workstations so the
49:14
logs are in petal and this is this is
49:25
like separate logs for workstation
49:26
stored somewhere else in public sort of
49:30
shared storage so like a very
49:31
interesting and unusual arrangement all
49:34
right so we kind of need to know roughly
49:37
what's in the law what's in a log entry
49:50
and unfortunately the papers not super
49:53
explicit about the format of a log entry
49:56
but we can imagine that the well the
49:58
paper does say that each workstations
50:00
log sits in a known place a known range
50:04
of block numbers in petal and
50:06
furthermore that each workstation uses
50:09
its log space and petal on a kind of in
50:11
a circular way that it is all right log
50:13
entries along from the beginning and
50:15
when it hits the end the workstation
50:18
will go back and reuse its log space
50:21
back at the beginning of its log area
50:23
and of course that means that were
50:25
stations need to be able to you know
50:26
clean their logs so that sort of ensure
50:30
that a log entry isn't needed before
50:32
that space is reused and I'll talk about
50:35
that in a bit but each a log consists of
50:39
a sequence of log entries each log entry
50:42
has a log sequence number it's just an
50:47
increasing number each workstation
50:48
numbers it's log entries 1 2 3 4 5 and
50:53
the immediate reason for this may be the
50:56
only reason for this that the paper
50:58
mentions is that the the way that French
51:02
penny just detects the end of a work
51:04
stations log if the work station crashes
51:06
is by scanning for words in its log in
51:10
petal until it sees the increasing
51:14
sequence stop increasing and it knows
51:16
then that the log entry with the highest
51:18
log sequence number must be the very
51:20
last entry as it needs to be able to
51:23
detect the end of the log ok so we have
51:27
this log sequence number and then I
51:29
believe each log actually has an an
51:31
array
51:32
of descriptions of model aughh entry has
51:35
an array of the descriptions of the
51:38
modifications all the different
51:39
modifications that were involved in a
51:41
particular operation or an operation of
51:44
some a file system system call so each
51:48
entry in the array is going to have a
51:51
block number it's a block number in
51:53
petal there's a version number which
52:00
we'll get to in a bit and then there's
52:07
the data to be written and so there's a
52:12
bunch of these required to describe
52:18
operations that might touch more than
52:19
one piece of data in the file system one
52:23
thing to notice is that the log only
52:25
contains information about changes to
52:29
metadata that is to directories and
52:32
inodes and allocation bitmaps in the
52:37
file system the log doesn't actually
52:38
contain the data that is written to the
52:41
contents of files it doesn't contain the
52:43
user's data it just contains information
52:45
enough information to make the file
52:47
systems structures recoverable after a
52:51
crash so for example if I create a file
52:55
called F in a directory that's gonna
52:58
result in a new log entry that has two
53:01
little descriptions of modifications in
53:04
it one a description of how to
53:06
initialize the new files inode and in
53:08
another description of a new name to be
53:11
placed in the new files directory
53:17
alright so one thing I didn't mention so
53:21
of course the log is really a sequence
53:22
of these log entries
53:28
initially in order to be able to do
53:32
modifications as fast as possible
53:34
initially a friend Japanese workstations
53:37
log is only stored inside the
53:40
workstations own memory and won't be
53:43
written back to peddle until it has to
53:46
be and that's so that you know writing
53:49
anything including log entries to peddle
53:51
you know it takes a long time so we want
53:53
to avoid even writing log entries back
53:55
to peddle as well as writing dirty data
53:58
or modified blocks back to peddle we'd
54:00
like to avoid doing that as long as
54:02
possible so the real full story for what
54:07
happens when a workstation gets a revoke
54:13
message from the lock server seeing that
54:15
it has to give up a certain lock so on
54:28
right now this is the same you know this
54:31
is though compared sporto's
54:33
protocols revoke message if the
54:36
workstation gets a revoke message the
54:39
series of steps it must take is first
54:41
it's great that's the right any parts of
54:47
its log that are only in memory and
54:48
haven't yet been written to peddle it's
54:50
got to make sure as log is complete and
54:52
pedal as the first step so it writes
54:54
it's long and only then does it write
55:09
any updated blocks that are covered by
55:16
the lock that's being revoked so write
55:21
modified blocks
55:28
just for that provoked to lock and then
55:40
send a release message and the reason
55:48
for this sequencing and for this strict
55:50
ban is that these modifications if we
55:54
write them to peddle you know their
55:55
modifications to the data structure the
55:58
file system data structure and if we
56:00
were to crash midway through baby news
56:01
box just as usual we want to make sure
56:04
that some other workstation somebody
56:07
else there's enough information to be
56:09
able to complete the set of
56:12
modifications that the were station is
56:14
made even though the workstation has
56:16
crashed and maybe didn't finish doing
56:17
these rights and writing the log first
56:20
it's gonna be what allows us to
56:22
accomplish it these these log records
56:24
are a complete description of what these
56:26
modifications are going to be so first
56:28
we you know first we write though the
56:30
complete log to petal and then we
56:33
workstation can start writing its
56:35
modified blocks you know maybe it
56:37
crashes maybe doesn't hopefully not and
56:39
if it finishes writing as modified
56:41
blocks then it could send the release
56:43
back to the lock server so you know if
56:45
my workstation has modified a bunch of
56:46
files and then some other workstation
56:48
wants to read one of those files this is
56:50
the sequence that happens lock so ever
56:52
asked me for my locks right back my
56:54
workstation right back said log then
56:56
right back
56:58
writes the dirty modified blocks to
57:01
peddle and only then releases and then
57:03
the other workstation can acquire the
57:04
lock and read these blocks so that's
57:06
sort of the non crash you know if a
57:09
crash doesn't happen that is the
57:13
sequence of course it's only interesting
57:17
if a crash happens yes
57:21
[Music]
57:35
okay so for the log you're absolutely
57:38
right it writes the entire log and yeah
57:42
so so if if we get a revoke for a
57:44
particular file the workstation will
57:47
write its entire log and then only it's
57:53
only because it's only giving up the
57:54
lock for Z it it only needs to write
57:59
back data that's covered by Z so I have
58:01
to write the whole log just the data
58:03
that's covered by the lock that we
58:05
needed to give up and then we can
58:07
release that lock so yeah you know maybe
58:10
this writing the whole log might be
58:11
overkill like you if it turned out you
58:13
know so here's an optimization that you
58:15
might or might not care about if the
58:18
last modification for profile Z for the
58:21
lock were giving up is this one but
58:22
subsequent entries in my log didn't
58:25
modify that file then I could just write
58:27
just this prefix of my in-memory log
58:30
back to petal and you know be lazy about
58:33
writing the rest and that might see me
58:36
sometime
58:37
I might have to write the log back it's
58:41
actually not clear I would save us a lot
58:42
of time we have to write the log back at
58:43
some point anyway and yeah I think petal
58:47
just writes the whole thing okay okay so
58:53
now we can talk about what happens when
58:56
a workstation crashes while holding
58:58
locks right it's you know needs to
59:01
modify something rename a file create a
59:03
file whatever it's acquired all the
59:05
locks it needs it's modified some stuff
59:07
in its own cache to reflect these
59:13
operations maybe written some stuff back
59:17
to petal and then crashed men possibly
59:19
midway through writing so there's a
59:21
number of points at which it could crash
59:24
right because this is always the
59:26
sequence it always just always before
59:31
writing modified blocks from the cache
59:33
back
59:34
the frangipani will always have written
59:36
it's logged pedal first that means that
59:39
if a crash happens it's either while the
59:41
worst station is writing us log back to
59:43
pedal but before it's written any
59:45
modified file or directory' blocks back
59:48
or it crashes while it's writing these
59:51
modified block back but therefore
59:53
definitely after it's written in its
59:55
entire log and so that's a very
59:57
important you know but or maybe the
60:00
crash happened after it's completely
60:01
finished all of this so you know there's
60:05
only because of the sequencing there's
60:06
only a limited number of kind of
60:08
scenarios we made me worried about for
60:10
the crash okay so the workstations
60:15
crashed its crashed you know for like to
60:18
be exciting let's crash while Holdings
60:19
locks the first thing that happens the
60:21
lock server sends it a revoke request
60:23
and the lock server gets no response all
60:26
right that's what starts to trigger
60:27
anything where did nobody ever asked for
60:29
the lock
60:32
basically nobody's ever going to notice
60:34
that the workstation crashed so let's
60:35
assume somebody else wanted one of the
60:37
locks that the workstation had while it
60:40
was crashed and the lock service ended
60:42
revoke and it will never get a release
60:44
back from the workstation after a
60:47
certain amount of time has passed and it
60:49
turns out frangipani locks use leases
60:52
for a number of reasons so you know
60:53
after the least time has expired the
60:56
lock server will decide that the
60:58
workstation must have crashed and it
61:01
will initiate recovery and what that
61:02
really means is telling a different
61:04
workstation the lock server will tell
61:06
some other live workstation look
61:08
workstation one seems to have crashed
61:10
please go read it's log and replay all
61:16
of its recent operations to make sure
61:18
they're complete and tell me when you're
61:20
done and only then the lock servers
61:22
going to release the locks so okay so
61:29
and and this is the point at which it
61:32
was critical that the logs are in pedal
61:34
because some other workstation is going
61:36
to inspect the crash workstations log in
61:39
pedal
61:42
all right so what are the possibilities
61:45
one is that the worst that you can crash
61:47
before it ever wrote anything back and
61:49
so that means this other work station
61:51
doing recovery will look at the crash
61:53
workstation this log see that maybe
61:55
there's nothing in it at all and do
61:57
nothing and then release the locks the
62:00
workstation held now the worst that you
62:02
may have modified all kinds of things in
62:04
its cache but if it didn't write
62:06
anything to his log area then it
62:09
couldn't possibly have written any of
62:11
the blocks that have modified during
62:12
these operations right and so well we
62:16
will have lost the last few operations
62:19
that the workstation did the file system
62:22
is going to be consistent with the point
62:24
in time before that crashed workstation
62:27
started to modify anything because
62:30
apparently the workstation never even
62:31
got to the point where it was writing
62:33
log entries the next possibilities of
62:35
the workstation wrote some log entries
62:38
the log area and in that case the
62:40
recovering workstation will scan forward
62:43
from the beginning of log until it's
62:45
stopped seeing the log sequence numbers
62:48
increasing that's the point of where's
62:51
the log must Anton and the recovering
62:53
workstation we'll look at each of these
62:56
descriptions of a change and basically
62:58
play that change back into petal I'll
63:02
say oh you know there's certain block
63:04
number and petal needs to have some
63:06
certain data written to it which is just
63:08
the same modification that the crashed
63:10
workstation did in its own local cache
63:15
so the recovering workstation we'll just
63:17
consider each of these and replay each
63:19
of the crashed workstations log entries
63:24
back into petal and when it's done that
63:26
all the way to the end of a crashed
63:28
workstations log as it exists in petal
63:32
it'll tell the lock server and the lock
63:36
server will release the crashed
63:37
workstations locks and that will bring
63:42
the pedal up to date with some prefix of
63:46
the operations the crash workstation had
63:50
done before crashing maybe not all of
63:51
them because maybe it didn't write out
63:53
all of its log but the recovery were
63:55
season
63:56
won't replay anything in a log entry
63:58
unless it has the complete log entry in
64:01
petal and so you know implicitly that
64:05
means there's gonna be some sort of
64:06
checksum arrangement or something so the
64:08
recovery work station will know aha this
64:11
log entry is complete and not like
64:13
partially written that's quite important
64:14
because the whole point of this is to
64:17
make sure that only complete operations
64:21
are visible and petal and never never
64:24
never a partial operation so it's also
64:26
important that all the rights for a
64:30
given operation or a group together in
64:32
the log so that on recovery the recovery
64:34
workstation can do all of the rights for
64:38
an operation or none of them never half
64:42
of them ok so that's what happens if the
64:48
crash happens while the log is being
64:50
written back to petal a another
64:55
interesting possibility is that the
64:57
crash workstation crashed after writing
64:59
its log and also after writing some of
65:02
the blocks back itself and then crashed
65:04
and then skimming over some extremely
65:08
important details which I'll get to in a
65:09
moment then what will happen is again
65:11
the recovery workstation of course the
65:12
recovery where station doesn't know
65:13
really the point at which the
65:14
workstation crashed all it sees is oh
65:19
here's some log entries and again the
65:21
recovery workstation will replay the log
65:23
in the same way and more or less what's
65:29
going on is that yeah even if the
65:30
modifications were already done in petal
65:32
we're replaying the same modifications
65:35
here the recovery where students were
65:36
playing the same modifications it just
65:38
writes the same data the same place
65:40
again and presumably not really changing
65:43
the value for the writes that had
65:46
already been completed but if the crash
65:48
workstation hadn't done some of its
65:49
rights then some of these rights were
65:50
not sure which will actually change the
65:53
data to complete the operations all
66:00
right
66:03
that's not actually as it turns out the
66:06
full story and today's question sets up
66:12
a particular scenario for which a little
66:14
bit of added complexity is necessary in
66:22
particular the possibility that the
66:24
crashed workstation had actually gotten
66:27
through this entire sequence before
66:29
crashing and in fact released some of
66:31
its locks or so that it wasn't the last
66:37
person the last workstation to modify a
66:40
particular piece of data so an example
66:42
of this is what happens if we have some
66:44
workstation and it executes say a delete
66:50
file it deletes a file say a file F and
66:57
directory D and then there's some other
67:03
workstation which after this delete
67:07
creates a new file with the same name
67:09
but of course it's a different file now
67:12
so workstation 1 I'm sorry
67:15
workstation two later create create same
67:22
file same file name and then after that
67:27
workstation 1 crashes so we're going to
67:32
need you to do recovery on workstation
67:34
ones log and so at this point in time
67:38
you know maybe there's a third
67:39
workstation doing the recovery
67:45
so now workstation 3 is doing a recover
67:52
on workstation ones log so the sequence
67:56
says workstation 1 deleted a file or
67:58
station 2 created a file or station 3
68:00
does recovery well you know could be
68:04
that this delete is still in workstation
68:06
ones log so workstation two may you know
68:09
or station 1 crash just going to go or
68:11
station 3 is going to look at its log
68:13
that's going to replay all
68:14
all the updates in workstation ones log
68:19
this delete may the updates for this
68:21
delete the entry for this delete may
68:23
still be in workstation ones log so
68:25
unless we do something clever
68:27
workstation 3 is going to delete this
68:29
file you know because this this
68:32
operation erased the relevant entry from
68:34
the directory
68:35
thus actually erasing deleting this file
68:40
that's it's a different file that
68:41
workstation 2 created afterwards so
68:44
that's completely wrong alright what we
68:46
want you know the how come we want is
68:48
you know horse station one deleted a
68:50
file that file should be deleted but a
68:52
new file that if her name should not be
68:55
deleted just because it was a crash in a
68:56
restart cuz this create happen after
68:58
delete all right so we cannot just
69:01
replay workstation ones log without
69:05
further thought because it may it may
69:09
essentially a log entry in workstations
69:11
one log may be out of date by the time
69:13
it's we played during recovery some
69:16
other workstation may have modified the
69:17
same data and some other way
69:19
subsequently so we can't blindly replay
69:22
the log entries and so this is this is
69:26
this is today's question and the way
69:29
frangipani solves this is by associating
69:32
version numbers with every piece of data
69:36
in the file system as stored in pedal
69:39
and also associating the same version
69:42
number with every update that's
69:45
described in the log so every log entry
69:49
when well first I don't have any that's
69:53
you know say in pedal I'll just say in
69:59
pedal every piece of metadata every
70:02
inode every every piece of data that's
70:06
like the contents of a directory for
70:08
example every block of data metadata in
70:12
stored and pedal has a version number
70:14
when a workstation needs to modify a
70:19
piece of metadata in pedal it first
70:21
reads that metadata from pedal into its
70:23
memory and then looks at the existing
70:27
version of
70:28
and then when it's creating the log file
70:30
describing its modification it puts the
70:32
existing version number plus one into
70:36
the log entry and then when it in if it
70:41
does get a chance to write the data back
70:43
it'll write the data back with the new
70:45
increased version number so if over
70:48
station hasn't crashed and it did or if
70:51
it did manage to write some data back
70:52
before it crashed then the version
70:55
number has stored in petal for the
70:56
effected metadata it will be at least as
70:59
high or higher than the version numbers
71:02
stored in the log entry there will be
71:04
higher some other workstations
71:05
subsequently modified so what will
71:09
actually happen here is that the what
71:13
workstation 3 we'll see is that the log
71:17
entry for workstations one delete
71:20
operation will have a particular version
71:23
number stored in the log entry that
71:26
associated with the modification to the
71:28
directory let's say and the log entry
71:31
will say well the version number for the
71:33
directory and the new version number
71:35
created by this log entry is version
71:37
number three in order for workstation
71:40
two to subsequently change the directory
71:42
that is to add a file app in fact before
71:45
it crashed the workstation one must have
71:47
given up the lock in the directory and
71:49
that's probably why the log entry even
71:52
exists in pedal so workstation 1 must
71:55
have given up the lock apparently
71:56
workstation two got the lock and read
71:58
the current metadata for the directory
72:02
saw that the version number was three
72:04
now and when workstation two writes this
72:08
data it will set the version number or
72:14
the directory in peddle to be 4 ok so
72:19
the that means the log entry for this
72:22
delete operation is going to have
72:24
version number 3 in it now when the
72:28
recovery software on worst agent 3
72:31
replays workstation ones log it looks at
72:34
the version numbers first so it'll look
72:36
at the version number the log entry
72:37
it'll read the block from
72:40
look at the version number in the block
72:42
and if the version number in the block
72:44
in pedal is greater than or equal to the
72:47
version number in the log entry the
72:50
recovery software will simply ignore
72:51
that update in the log entry and not do
72:54
it because clearly the block had already
72:57
been written back by the crash
72:59
workstation and then maybe subsequently
73:01
modified by other workstations so the
73:05
replay is actually selectively based on
73:06
this version number that replay it's a
73:08
recovery only writes only
73:14
replays are right in the log if that
73:17
right is actually newer right in the log
73:20
entry is newer than the data that's
73:22
already stored in peddle so one sort of
73:31
irritating question here maybe is that
73:34
workstation three is running this
73:37
recovery software while other
73:39
workstations are still reading and
73:41
writing in the file system actively and
73:42
have locks and knows what to peddle so
73:46
the replay it's gonna go on while we're
73:50
station to which that doesn't know
73:52
anything about the recovery still active
73:54
and indeed workstation two may have the
73:57
lock for this directory
74:00
while recoveries going on so recovery
74:03
may be scanning the log and you no need
74:05
to read or write this directories data
74:08
in pedal while workstation two still has
74:11
the lock on this data the question is
74:14
how you know how do we sort this out
74:16
like one possibility which actually
74:19
turns out not to work is for the
74:22
recovery software to first acquire the
74:24
lock on anything that it needs to look
74:28
at in petal before while it's replaying
74:30
the log and the the you know one good
74:36
reason why that doesn't work is that it
74:38
could be that we're running recovery
74:39
after a system-wide power failure for
74:41
example in which all knowledge of who
74:43
had what locks is lost and therefore we
74:46
cannot write the recovery software to
74:49
sort of participate in the locking
74:52
protocol because
74:54
you know all knowledge of what's locked
74:56
my slot not locked may have been lost in
74:57
the power failure
74:58
um but luckily it turns out that the
75:01
recovery software can just go ahead and
75:03
read or write blocks in pedal without
75:07
worrying about sorry read or write data
75:09
in pedal without worrying at all about
75:11
locks and the reason is that if the
75:15
recovery software you know the recovery
75:16
software wants to replay this log entry
75:18
and possibly modify the data associated
75:20
with this directory it just goes ahead
75:22
and reads whatever's there for the
75:23
directory out of pedal right now and
75:26
there's really only two cases either the
75:28
crash workstation one had given up its
75:30
lock or it hadn't if it hadn't given up
75:33
this lock then nobody else can have a
75:35
directory locked and so there's no
75:36
problem if it had given up its lock then
75:39
before I gave it up its lock it must
75:42
have written that it's data for the
75:46
directory back to pedal and that means
75:50
that the version number stored in pedal
75:52
must be at least as high as the version
75:53
number in the crashed workstations log
75:56
entry and therefore when recovery
75:58
software compares the log entry version
76:00
number with the version number of the
76:02
data and pedal it'll see that the log
76:04
entry version number is not higher and
76:07
therefore won't we play the log entry so
76:11
yeah the recovery software will have
76:13
read the block without holding the lock
76:15
but it's not going to modify it because
76:18
if the locked was released the version
76:19
number will be high enough to show that
76:21
the log entry had already been sort of
76:26
processing to pedal before the crashed
76:28
workstation crashed no so there's no
76:31
locking issue alright this is the I've
76:41
gone over that kind of main guts of what
76:43
pedal is up to Nam it's cache coherence
76:46
it's distributed transactions and it's
76:49
distributed crash recovery the other
76:53
things to think about are the the paper
76:55
talks a bit about performance it's
76:57
actually very hard after over 20 years
77:00
to interpret performance numbers because
77:02
they brand their performance numbers on
77:04
very different Hardware in a very
77:06
different environment from
77:08
you see today roughly speaking the
77:11
performance numbers they show or that as
77:12
you add more and more friendship and
77:15
work stations the system basically
77:19
doesn't get slower that is each new
77:22
workstation even if it's actively doing
77:24
file system operations doesn't slow down
77:26
the existing workstation so in that
77:28
sense the system you know at least for
77:30
the application state look at the system
77:32
was giving them reasonable scalability
77:34
they could add more workstations without
77:36
slowing existing users down looking
77:42
backwards
77:44
although frangipani is full of like very
77:47
interesting techniques that are worth
77:49
remembering it didn't have too much
77:51
influence in on how on the evolution of
77:55
storage systems part of the reason is
77:58
that the environment for which is aimed
78:00
that is small workgroups
78:02
people sitting in front of workstations
78:04
on their desks and sharing files that
78:06
environment well it still exists in some
78:09
places isn't really where the action is
78:12
in distributed storage the action the
78:13
real action is moved into sort of big
78:15
data center or big websites big data
78:19
computations and there you know in that
78:22
world first of all the file system
78:24
interface just isn't very useful
78:25
compared to databases like people really
78:28
like transactions in the big website
78:31
world but they need them for very small
78:33
items of data the kind of data that you
78:35
would store in a database rather than
78:39
the kind of data that would you would
78:40
naturally store in a file system so you
78:44
know some of this technology might sort
78:47
of you can see echoes of it in modern
78:49
systems but it usually takes the form of
78:50
some database the other big kind of
78:53
storage this out there is storing big
78:56
files as needed for big data
78:59
computations like MapReduce and indeed
79:01
GFS is a you know to some extent looks
79:04
like a file system and is the kind of
79:06
storage system you want for MapReduce
79:08
but for GFS and for big data
79:12
computations frangipane ease you know
79:15
focus on local caching and workstations
79:19
and very close attention to
79:22
cache coherence and locking it's just
79:24
not very useful you know for both the
79:27
data read and write
79:29
typically caching is not useful at all
79:33
right if you're reading through ten
79:35
terabytes of data it's really
79:38
counterproductive almost to cache it so
79:41
a lot of the focus in frangipani is sort
79:45
of time is pass it by a little bit it's
79:47
still useful in some situations but it's
79:50
not what people are really thinking
79:52
about in designing new systems for all
79:56
right that is it