字幕記錄 00:00 today the paper I'm going to discuss 00:04 this frangipani this is a fairly old 00:16 distributed file system paper the reason 00:18 why were reading it though is because it 00:21 has a lot of interesting and good design 00:23 having to do with cache coherence and 00:25 distributed transactions and distributed 00:28 crash recovery as well as the 00:31 interactions between them so those are 00:33 those are the really the ideas behind 00:35 this that we're gonna try to tease out 00:38 so these are it's really a lot of our 00:42 caching coherence is really the idea of 00:47 if I have something cached that 00:50 nevertheless if you modify it despite me 00:52 having a cache you know something will 00:54 happen so I can see your your 00:56 modifications and we also have 00:58 distributed its actions which are needed 01:04 internally to file systems to be able to 01:06 make complex updates to the file system 01:09 data structures and because the file 01:14 system is essentially split up among a 01:16 bunch of servers it's critical to be 01:19 able to recover from crashes and those 01:20 servers the overall design a friend 01:30 pammi it's a network file system it's 01:34 intended to look to existing 01:37 applications this is intended to work 01:39 with existing applications like UNIX 01:41 programs ordinary UNIX programs we're 01:43 running on people's workstations much 01:47 like Athena's AFS lets you get at your 01:51 Athena home directory and various 01:53 project directories from any Athena 01:56 workstation so the kind of overall 01:58 picture is that you have a bunch of 02:00 users each user in the papers world is 02:06 sitting in front of a workstation which 02:07 is you know real not a laptop in those 02:09 days but sort of computer with a 02:10 keyboard and display in a mouse and 02:12 Windows system at all so each one is 02:14 sitting 02:14 front of a computer workstation I'm 02:17 gonna call the workstations you know 02:19 workstation one more station to each 02:24 workstation runs an instance of the 02:27 frangipani server I meant so a huge 02:30 amount of the you know almost all of the 02:32 stuff that happens in this paper goes on 02:36 in the frangipani software in each 02:40 workstation so maybe they're sitting in 02:42 front of a workstation and they might be 02:44 running ordinary programs like a text 02:47 editor that's reading and writing files 02:48 and maybe when they finished editing a 02:50 source file they run it through the 02:51 compiler that the Reiser source file 02:53 when these ordinary programs make file 02:57 system calls inside the kernel there's a 02:59 frangipani module that implements the 03:08 file system inside of all of these 03:10 workstations each other on copy and then 03:15 the real storage of the file system data 03:18 structures things like certainly file 03:19 contents but also I nodes and 03:22 directories and alissa file and each 03:24 directory and the information about what 03:26 I knows and what blocks are free all 03:28 that's stored in a shared virtual disk 03:32 surface called petal it's on a separate 03:37 set of machines that are you know 03:38 probably server machines and a machine 03:40 room rather than workstations on 03:42 people's desks pedal among many other 03:45 things replicates data so you can sort 03:46 of think of pedal servers is coming in 03:48 pairs and one crashes we can still get 03:55 at our data and so when frangipani 03:58 needs to read or write a you know read a 04:00 directory or something it sends a remote 04:02 procedure call off to the correct pedal 04:04 server to say well here's the block that 04:06 I need you know please read it for me 04:08 please return that block and for the 04:10 most part petal is acting like a disk 04:13 drive you can think of it as a kind of 04:16 shared as a shared disk drive that all 04:22 these frangipane knees talk to and it's 04:25 called a virtual disk 04:31 from our point of view for most of this 04:34 discussion we're just going to imagine 04:35 pedal is just being a disk ride that's 04:37 used over the network by all these 04:39 friends of Handy's and so it has you 04:41 read and write it by giving it a block 04:44 number or an address on the disk and 04:46 seem like I'd like to read that block 04:47 just like an ordinary hard drive okay so 04:56 the intended use for this file so the 05:00 use that the authors intended is 05:02 actually reasonably important driver in 05:05 the design what they wanted was to 05:08 support their own activities and what 05:10 they were were were members of a 05:12 research lab of maybe say 50 people in 05:16 this research lab and they were used to 05:18 shared infrastructure things like time 05:20 sharing machines or workstations using 05:24 previous network file systems to share 05:27 files among cooperating groups of 05:29 researchers so they both wanted they 05:31 they wanted a file system that they 05:32 could use to store their own home 05:33 directories in as well as storing shared 05:37 project files and so that meant that if 05:39 I edit a file I'd really like the other 05:41 people my work and the other people I 05:43 work with to be able to read the file I 05:45 just edited so we want that kind of 05:47 sharing and in addition it's great if I 05:51 can sit down at any workstation my 05:52 workstation your workstation a public 05:54 workstation in the library and still get 05:57 at all of the files of all my home 05:59 directory everything I need in my 06:01 environment so they're really interested 06:03 in a shared file system for human users 06:07 in a relatively small organization small 06:11 enough that everybody was trusted all 06:12 the people all the computers so really 06:14 the design has essentially nothing to 06:16 say about security and indeed arguably 06:19 would not work in an environment like 06:21 Athena where you can't really trust the 06:23 users or the workstations so it's really 06:26 very much designed for their their 06:29 environment now as far as performance 06:34 their environment was also important you 06:36 know it turns out that the way most 06:37 people use computers are leased 06:39 workstations they sit in front of is 06:40 that they mostly read and write their 06:41 own files and they may read some shared 06:44 files you know programs or some project 06:48 files or something but most of the time 06:50 I'm reading and writing my files and 06:51 you're reading and writing your files on 06:53 your workstation and you know it's 06:55 really the exception that we're actively 06:57 sharing files so it makes a huge amount 06:59 of sense to be able to one way or 07:01 another even though officially the real 07:04 copies of files are stored in this 07:06 shared disk it's fantastic if we can 07:08 have some kind of caching so that after 07:11 I log in and I use my files for a while 07:12 they're locally cached here so they can 07:15 be gotten gotten that and you know 07:17 microseconds instead of milliseconds if 07:20 we have to fetch them from the file 07:22 servers ok so French Pyrenees supported 07:27 this this kind of caching furthermore it 07:29 supported right-back caching not only 07:33 caching in each in each workstation and 07:38 each frangipani server we also have 07:40 right back caching which means that if I 07:46 want to modify something if I modify a 07:49 file or even create a file in a 07:51 directory or delete a file or basically 07:53 do any other operation as long as nobody 07:55 else no other workstation needs to see 07:57 it 07:59 frangipani acts with a write back cache 08:01 and that means that my writes stay only 08:05 local in the cache if I create a file at 08:07 least initially the information about 08:09 the newly created file said a newly 08:11 allocated inode with initialized 08:13 contents and you know a new entry added 08:16 to a new name attitudes to my home 08:19 directory all those modifications 08:21 initially are just done in the cache and 08:23 therefore things like creating a file 08:25 can be done extremely rapidly they just 08:27 require modifying local memory in this 08:30 machine's disk cache and they're not 08:32 written back in general to peddle until 08:34 later so at least initially we can do 08:37 all kinds of modifications to the file 08:39 system at least to my own directories my 08:41 own files completely locally and that's 08:44 enormous ly helpful for performance it's 08:47 like a you know factor of a thousand 08:48 difference being able to modify 08:50 something in local memory versus having 08:52 to send a remote procedure calls to send 08:54 server now one serious consequence of 08:59 that it's extremely determinative of the 09:03 architecture here is that that meant 09:06 that the logic of the file system has to 09:09 be in each workstation in order for my 09:11 workstation to be able to implement 09:13 things like create a file just operating 09:15 out of its local cache it means all the 09:17 logic all the intelligence for the file 09:20 system has to be sitting here in my 09:22 workstation and in their design 09:24 basically to a first approximation the 09:26 pedal shared storage system knows 09:28 absolutely nothing about file systems or 09:31 files or directories all that logic this 09:34 is a very in a sense very 09:35 straightforward simple system and all 09:38 the complexity is here in the frangipani 09:41 in each client so it's a very kind of 09:46 decentralized scheme and one of the 09:49 reasons is - because that's what you 09:50 really need or these that was a design 09:53 they could think of to allow them to do 09:56 modifications purely locally in each 09:58 workstation it does have the nice side 10:00 effect though that I'm since most of the 10:03 complexity and most of the CPU time 10:05 spent is spent here it means that as you 10:07 add workstations as you add users to the 10:09 system you automatically get more CPU 10:13 capacity to run those new users file 10:16 system operations because most file 10:18 system operations happen just locally in 10:21 the workstation that's most of the CPU 10:23 time is spent here so the system does 10:25 have a certain degree of natural scaling 10:27 scalability as you add workstations each 10:30 new workstation is a bit more load from 10:32 a new user but it's also a bit more 10:33 available CPU time to run that users 10:35 file system operations of course at some 10:38 point you're gonna run out of gas here 10:40 in the central storage system and you 10:44 know then you may need to add more 10:46 storage servers to all right 10:54 so okay so we have the system that does 10:59 serious caching here and furthermore 11:01 does the modifications in the cache that 11:05 actually these immediately to some 11:06 serious challenges in the design and the 11:09 design is mostly about solving the 11:11 challenges I'm about to lay out these 11:15 are largely count challenges of that's 11:20 come from caching and this sort of 11:26 decentralized architecture where most of 11:28 the intelligence is sitting in the 11:30 clients so the first challenge is that 11:45 suppose workstation one creates a file 11:49 in you know maybe a file say /a a new 11:55 file /a and initially it just creates 11:59 this in its local cache so that you know 12:01 first it may need to fetch the current 12:03 contents of the slash directory from 12:05 petal nom but then when it creates a 12:06 file just modifies its cached copy and 12:10 doesn't immediately send it back to 12:11 peddle then there's an immediate problem 12:14 here suppose the user on workstation 2 12:17 tries to get a directory listing of the 12:19 directory slash right we'd really like 12:21 to be able to this user see the newly 12:24 created file right and that's what users 12:27 are gonna expect and users will be very 12:29 confused if you know person down the 12:31 hall from me created a file and said oh 12:33 you know I put all this interesting 12:34 information in this new file /a why 12:36 don't you go read it and then I try to 12:37 read it and it's totally not there so we 12:41 absolutely want very strong consistency 12:43 if the person down the hall says they've 12:45 done something in the file system I 12:46 should be able to see it and if I edit a 12:49 file on one work station and then maybe 12:51 compile it on a computer on another 12:54 computer I want the compiler to see the 12:56 modifications I just made to my file 12:58 which means that the file system has to 13:01 do something to ensure that readers see 13:05 even the most recent rights so we've 13:09 been talking about this as we've been 13:11 calling this you know strong strong 13:13 consistency and linearize ability before 13:15 and that's basically what we want in the 13:18 context of caches though like the issue 13:20 here is not really about the storage 13:22 server necessarily it's about the fact 13:24 that there was a modification here that 13:26 needs to be seen somewhere else and now 13:28 for historical reasons that's usually 13:30 called cache coherence that is the 13:38 property of a caching system that even 13:41 if I have an old version of something 13:43 cached if someone else modifies it in 13:46 their cache then my cache will 13:48 automatically reflect their 13:50 modifications so we want this cache 13:53 coherence property another issue you 13:58 have is that the you know everything all 14:01 the files and directories are shared we 14:03 could easily have a situation where two 14:05 different workstations are modifying the 14:09 same directory at the same time so 14:11 suppose again maybe the user one on 14:14 their workstation wants to create a file 14:16 /a which is a new file in the directory 14:18 slash in the new in the root directory 14:20 and at the same time user two wants to 14:24 create a new file called slash B so at 14:27 some level you know they're creating 14:29 different files alright a and B but they 14:32 both need to modify the root directory 14:33 to add a new name to the root directory 14:35 and so the question is even if they do 14:38 this simultaneously you know to file 14:40 creations of differently named files but 14:42 in the same directory from different 14:44 workstations will the system be able to 14:46 sort out these concurrent modifications 14:51 to the same directory and arrive at some 14:52 sensible result and of course the 14:54 sensible result we want is that both a 14:56 and B end up existing we don't want to 14:58 end up with some you know situation in 15:01 which only one of them ends up existing 15:04 because the second modification 15:06 overwrote and sort of superseded the 15:10 first modification 15:15 and so this is again it goes by a lot of 15:19 different names but we'll call it a de 15:21 Missa T we want operations such as 15:26 create a file to lead a file to act as 15:28 if they just are instantaneous 15:31 instantaneous and time and don't ever 15:34 therefore don't ever interfere with 15:37 operations that occur at similar times 15:39 by other workstations 15:41 well things to happen just at a point in 15:43 time and not be spread over even if 15:46 they're complex operations and involve 15:47 touching a lot of state we want them to 15:50 appear as if they occur instantaneously 15:54 at a final problem we have is suppose 16:00 you know my workstation is modified a 16:04 lot of stuff and maybe it's 16:05 modifications are or many of its 16:07 modifications are done only in the local 16:10 cache because of this right back caching 16:12 if my were station crashes after having 16:16 modified some stuff in its local cache 16:18 and maybe reflected some but not all 16:20 those modifications back to storage 16:22 pedal other workstations are still 16:27 executing and they still need to be able 16:29 to make sense of the file system so the 16:32 fact that my workstation crashed while I 16:34 was in the middle of something had 16:35 better not wreck the entire file system 16:38 for everybody else or even any part of 16:39 it so that means what we need is crash 16:45 recovery of individual servers we won't 16:49 be able to have my workstation crash 16:51 without disturbing the activity of 16:53 anybody else using the same shared 16:55 system even if they look at my directory 16:57 in my files they should see something 16:58 sensible maybe it won't include the very 17:00 last things I did but they should see a 17:02 consistent file system and not a rekt 17:06 file system data structure so we want 17:08 crash recovery 17:13 as always with distributed systems 17:16 that's made more complex because we can 17:18 easily have a situation where only one 17:20 of the servers crashes but the others 17:22 are running and again for all of these 17:27 things for all three of these challenges 17:30 they're really challenged we're in this 17:32 discussion their challenges about how 17:34 frangipani works and how these 17:36 frangipani 17:38 software inside the workstations work 17:40 and so when I talk about a crash I'm 17:42 talking about a crash of a workstation 17:43 and it's frangipani you know the pedal 17:46 virtual disk has many similar questions 17:50 associated with it but there are not 17:51 really the focus today it has a 17:55 completely separate set of R'lyeh fault 17:59 tolerance machinery built into pedal and 18:02 it's actually a lot like the chain 18:04 replication kind of systems we talked 18:06 about earlier ok so I'm going to talk 18:12 about each of these challenges in turn 18:15 the first challenge is cache coherence 18:22 and the game here is to get both the 18:29 benefits of both linearize ability that 18:32 is when I read when I look at anything 18:35 in the filesystem I always see fresh 18:36 data I always see the very latest data 18:38 so we got both linearize ability and 18:42 caching not caching that's good caching 18:45 as we can get for performance so somehow 18:48 we you know we need to get the benefits 18:50 of both of these and the kind of that 18:56 people implement cache coherence that is 18:59 using what are called cache coherence 19:01 protocols and it turns out these 19:02 protocols are used a lot in many 19:04 different situations not just 19:06 distributed file systems but also things 19:08 like the caches in multi-core the per 19:12 core caches in multi core processors 19:14 also use cache coherence protocols which 19:17 are going to be not unlike the protocols 19:20 I'm going to describe for frangipani all 19:23 right 19:23 so it turns out that frangipani x' cache 19:29 coherence is driven by its use of locks 19:32 and we'll see locks come up later in 19:34 both actually for both atomicity and 19:37 crash recovery but the particular use of 19:39 locks I'm going to talk about for now is 19:41 a use of blocks to drive cache coherence 19:43 to help workstations ensure that even 19:45 though they're caching data they're 19:46 caching the latest data so as well as 19:51 the frangipani servers and workstations 19:52 and pedal servers there's a third kind 19:55 of server in the frangipani system 19:59 there's lock servers and so we're I'm 20:02 just gonna pretend there's one lock 20:04 server although you could shard the 20:06 locks over multiple servers so here's a 20:10 lock server it's a separate you know 20:16 it's logically at least a separate 20:17 computer although I think they ran them 20:19 on the same hardware as the pedal 20:21 servers but it basically just has a 20:24 table of named locks and locks are named 20:29 we'll consider them to be named after a 20:32 named as after file names although in 20:34 fact they're named after I numbers so we 20:37 have for every file we have a lock 20:43 potentially and each lock is possibly 20:48 owned by some owner for this discussion 20:51 I'm just gonna assume I'm gonna describe 20:54 it as if the locks were exclusive locks 20:56 although in fact frangipani has a more 20:58 complicated scheme for locks that allow 21:01 either one writer or multiple readers so 21:03 for example maybe file X has recently 21:08 been used by workstation 1 and 21:10 workstation 1 has a lock on it and maybe 21:15 file Y is recently used by workstation 2 21:17 and workstation 2 has a lock on it and 21:20 the lock server will remember off or 21:21 each file 21:22 who has the lock if anyone maybe nobody 21:24 does on that file and then in each 21:28 workstation 21:29 each workstation keeps track of which 21:33 locks it holds and this is tightly tied 21:36 to it 21:37 I'm keeping track of cache data as well 21:38 so in each workstations frangipani 21:42 module there's also a lock table and 21:52 record what file the more session to 21:55 lock for what kind of lock it has and 21:59 the contents the cached contents of that 22:03 file so that might be a whole bunch of 22:04 data blocks or maybe directory contents 22:07 for example so there's a lot of content 22:10 here so Linda frangipani server decides 22:14 oh it needs to read it needs to use the 22:17 directory slash or look at the file a or 22:19 look at an inode it first gets asked the 22:22 lock server for a lock on whatever it's 22:24 about to use and then it asks petal to 22:26 get the data for whatever that file or 22:30 directory or whatever it is and it needs 22:32 to read and then the workstation 22:34 remembers oh ho you know I have a copy 22:36 of file X its content is whatever the 22:41 content of file X is cached and it turns 22:46 out that workstations can have a lock in 22:48 at least two different modes what the 22:51 workstation can be actively reading or 22:54 writing whatever that file or directory 22:56 is right now that it's in the middle of 22:59 a file creation operation or deletion or 23:01 rename or something so in that case I'll 23:05 say that the lock is held by the 23:09 workstation and is busy it could also be 23:12 after a workstation has done some 23:16 operation like create a file or maybe 23:17 read a file 23:18 you know then release the lock as soon 23:20 as it's done with that system call 23:22 whatever system call like rename or read 23:24 or write or create as soon as the system 23:25 calls over the workstation will give up 23:27 the lock at least internally it's not 23:31 actively using that file anymore but 23:34 it'll as far as the lock server is 23:36 concerned the workstation will hold the 23:37 lock but the workstation notes for it 23:39 its own use that it's not actively using 23:42 that lock anymore as well call that the 23:45 lock is still held by the workstation 23:49 I'm just but the work station isn't 23:54 really using it and that'll be important 23:57 in a moment okay so I think these two 24:01 are set up consistently if we assume 24:02 this is workstation one the lock server 24:05 knows Oh locks for x and y exists and 24:07 they're both held by workstation one 24:08 workstation one has equivalent 24:11 information in its table it knows it's 24:13 holding these two blocks and furthermore 24:15 it has the it's remembering the content 24:18 is cached for the filers of directories 24:20 that the two locks cover there's a 24:26 number of rules here that in that 24:28 frangipani follows that caused it to use 24:32 locks in a way that provide cache 24:35 coherence then sure nobody's ever 24:36 meaning using stale data from their 24:38 cache so so these are basically rules 24:46 that are using conjunction with the 24:48 locks and cache data so one the really 24:53 overriding invariant here is that no 24:58 workstation is allowed to cache data to 25:00 hold any cached data unless it also 25:02 holds the lock associated with that data 25:05 so basically it's no cache data without 25:13 a lock without the lock that protects 25:17 that data and operationally what this 25:21 means is a workstation before it uses 25:25 data it first acquires the lock on the 25:27 data from the lock server and after the 25:29 workstation has the lock only then does 25:32 the workstation read the data from petal 25:34 and put it and put it in its cache so so 25:40 the sequence is you can acquire a lock 25:41 and then read from petal 25:50 I'll tell you at the lock of course you 25:53 know you weren't passing the data you 25:55 want to catch the data you first got to 25:56 get the lock and only strictly 25:58 afterwards read from petal and if you 26:02 ever release a lock then the rule is 26:05 that before releasing a lock you first 26:08 have to write if you modified the lock 26:10 data in your cache before you release 26:13 the lock you have to write the data back 26:15 to modify data back to petal and then 26:18 only when petals as yes I got the data 26:20 only then you'll have to release the 26:22 lock that is gives a lock back to the 26:23 lock server so the sequence is always 26:27 first you write the cache dated a petal 26:31 storage system and then release the lock 26:40 and erase the entry whoops 26:43 erase the entry and the cat and the 26:45 cache data from your from that 26:47 workstations lock table what this 26:52 results in the the protocol between the 26:55 lock server and between the workstations 26:57 and the lock server consists of four 27:01 different kinds of messages this is the 27:04 coherence protocol these are just 27:11 network you can think of them as 27:13 essentially sort of one-way Network 27:14 messages there's a request message from 27:20 from workstations to the lock server 27:25 request message says oh hey lock server 27:27 I'd like to get this lock when the lock 27:30 server is willing to give you the lock 27:34 and of course if somebody else holds if 27:35 the lock server can't immediately give 27:37 you the lock but if when the lock 27:39 becomes free the lock server will 27:41 respond we have a grant message then the 27:46 lock server back to the workstation in 27:49 response to an earlier request well if 27:51 you request a lock for the lock server 27:53 and someone else holds the lock right 27:55 now that other workstation has to first 27:58 give up the lock we can't have two 27:59 people owning the same lock so how are 28:02 we going to get that works 28:03 the lock well what I said here is that 28:07 when a lock station is you know when 28:08 it's actually using the lock and 28:09 actively reading or writing something it 28:11 has the lock and it's marked it busy but 28:13 the workstations don't give up their 28:15 locks ordinarily when they're done using 28:18 them so if I if I create a file and then 28:21 create system call finishes I'll still 28:24 have that file that new file locked and 28:26 also own the lock for that my 28:28 workstation will still all in the lock 28:29 for that file it'll just be in state 28:31 idle instead of busy but as far as the 28:33 lock server is concerned 28:34 well my workstation still has the lock 28:36 and the reason for this the reason to be 28:38 lazy about handing locks back to the 28:40 lock server is that if I create a file 28:42 called Y on my workstation 28:43 I'm almost certainly going to be about 28:46 to use Y for other purposes like maybe 28:48 write some data to it or read from it or 28:51 something so it's extremely advantageous 28:53 for the workstation to sort of 28:55 accumulate locks for all of the recently 28:58 used files in the workstation and not 29:00 give them back unless it really has to 29:02 and so in the ordinary in the common 29:05 case in which I use a bunch of files in 29:07 my home directory and nobody else on any 29:09 other workstation ever looks at them my 29:11 workstation ends up accumulating dozens 29:14 or hundreds of locks in idle state for 29:16 my files but if somebody else does look 29:18 at one of my files they need to first 29:20 get the lock and I have to give up the 29:22 lock so the way that works is that if 29:25 the lock server receives a lock request 29:27 and it sees in the lock server table AHA 29:30 you know that lock is currently owned by 29:31 workstation 1 the lock server will send 29:34 a revoke message to whoever the 29:38 workstation that currently owns that 29:40 lock saying look you know somebody else 29:42 wants it please give up the lock when a 29:47 workstation receives a revoke request if 29:49 the lock is idle then if the cache data 29:54 is dirty the workstation will first 29:56 write the cat dirty data that modified 30:00 data from his cache back to peddle 30:01 because the rule says the rule that in 30:04 order to never cache data without a lock 30:06 says we got our right the modify dated 30:08 back to peddle before releasing so if 30:11 the locks idle would first write back 30:13 the data if it's modified back to peddle 30:16 and 30:17 then send a message back to the lock 30:22 server saying it's okay we give up this 30:24 lock so the response to revoke send to a 30:35 workstation is the worst station sends 30:37 it released of course if the worst 30:39 station gets a revoke while it's 30:40 actively using a lock while it's in the 30:42 middle of a delete or rename or 30:44 something that affects the locked file 30:49 the worst station will not give us a 30:51 lock until it's it's done using and 30:53 until it's finished that file system 30:55 operation whatever system call it was 30:56 that was using this file and then the 30:58 lock in the worst stations lock state 31:00 will transition to idle and then you'll 31:03 be able to pay attention to the revoke 31:07 request and after writing to peddle if 31:10 need be released the lock alright so 31:12 this is the is the coherence protocol 31:17 that fringe that well this is a 31:21 simplification of the coherence protocol 31:23 that frangipani uses as I mentioned 31:24 before what's missing from all this is 31:26 the fact that locks can be either 31:28 exclusive for writers or shared for 31:31 read-only access and just like petal is 31:38 a block server and doesn't understand 31:41 anything about file systems the lock 31:44 server also these IDs these are really 31:47 lock identifiers and the locks are 31:49 doesn't know anything about files or 31:51 directories or file system it just has 31:53 these it's just has this table with 31:55 opaque IDs and who owns you know that 31:58 name locks and who owns those locks and 32:01 it's frangipani that knows ah you know 32:03 the lock that I associate was he given a 32:05 file has such and such an identifier and 32:08 as it happens prin Japan uses unix-style 32:11 I numbers or the numbers associated with 32:14 files instead of names for locks so just 32:20 to make this coherence protocol concrete 32:25 and to illustrate again the relationship 32:28 between petal operations 32:31 and lock server operations let me just 32:33 run through what happens if one 32:35 workstation modifies some file system 32:37 data and then in another workstation 32:40 means to look at it so we have two 32:43 workstations the lock server so the way 32:50 the protocol plays out if workstation 32:52 one wants to read since a workstation 32:56 one wants to read and then modify files 32:58 e so before it can even read anything 33:02 about Z from peddle it must first 33:06 acquire the lock for Z so it sends an 33:11 acquire request to the lock server maybe 33:13 nobody holds the lock or lock servers 33:15 never heard anything about it 33:16 so the locks are makes a new entry for Z 33:19 and it stable returns our reply saying 33:21 yes 33:24 you own the grant for lock C and at this 33:33 point the workstation says it has the 33:34 lock on file Z isn't entitled to read 33:37 information about it from petal so at 33:40 this point we're gonna read Z from petal 33:52 and indeed workstation one can modify it 33:55 locally in their cache at some later 33:57 point maybe the human being and sitting 33:59 in front of workstation two wants to 34:01 also to read file Z while the 34:04 workstation two doesn't have the lock 34:06 for files the ISA the very first thing 34:07 it needs to do is send a message the 34:09 lock server saying oh yeah I'd like to 34:12 get the lock for file Z the lock server 34:16 knows it can't reply yes yet because 34:18 somebody else has the lock namely 34:20 workstation one my the lock server sends 34:23 in response a revoke 34:30 the workstation one workstation one not 34:34 allowed to give up the lock until it 34:35 writes any modified data back to the 34:37 pedal so it's now gonna write the model 34:42 anything modified content the actual 34:45 contents of the file with always 34:46 modified back to pedal only then is 34:51 workstation two allowed to send a 34:54 release back to the lock server the lock 35:02 server with must have kept a record in 35:04 some table saying well you know there's 35:05 somebody waiting for lock Z as soon as 35:07 its current holder releases that we need 35:10 to reply and so this receipt of this 35:14 release will cause the lock server to 35:17 update its tables and finally send the 35:19 grant back to or station two and at this 35:26 point now our station two can finally 35:28 read files even pedal this is how the 35:36 cache coherence protocol plays out to 35:39 ensure that everybody who does a read 35:43 doesn't read the data until whoever the 35:46 previous until anybody who might have 35:49 had the data modified privately in their 35:52 cache first writes the data back to 35:54 pedal all right so the locking machinery 36:01 forces reads to see the latest right so 36:04 what's going on there's a number of the 36:12 optimizations that are possible in these 36:14 kind of cache coherence protocols 36:16 I mean I've actually already described 36:18 one this idle state the fact that 36:20 workstations hold onto locks that 36:22 they're not using right now instead of 36:23 immediately releasing them that's 36:25 already an optimization to the simplest 36:28 protocol you can think of and the other 36:30 main optimization is that the frangipani 36:33 has is that it has a notion of shared 36:36 versus shared read locks versus 36:39 exclusive write locks so have lots and 36:41 lots of workstations need to be 36:42 the same file but nobody's writing it 36:44 they can all have a lock a read lock on 36:47 that file and if somebody does come 36:49 along and try to write this file that's 36:51 widely cached they first need to first 36:54 revoke everybody's read lock so that 36:57 everybody gives up their cached copy and 36:59 only then is a right or allowed to write 37:01 the file but it's okay now because 37:03 nobody has a cache copy anymore so 37:05 nobody could be reading stale data while 37:08 it's being written all right so that's a 37:13 cache coherence story driven by driven 37:21 by the locking protocol next up in our 37:26 list of yes yes that's a good question 37:35 in fact there's a risk here in the 37:42 scheme I described that if I modify a 37:44 file on my workstation and nobody else 37:47 reads it for nobody else reads it that 37:50 the only copy of the modified file maybe 37:53 have some precious information in it is 37:55 on in in the cache in RAM on my 37:58 workstation and my works they were to 38:00 crash then and you know we hadn't done 38:03 anything special then it would have 38:05 crashed with the only copy of the data 38:07 and the data would be lost so in order 38:09 to forestall this no matter what all 38:12 these workstations write back anything 38:15 that's in their cache any modified stuff 38:18 in their cache every 30 seconds so that 38:21 if my workstation crash is unexpectedly 38:23 I may lose the last 30 seconds at work 38:25 but no more there's actually just mimics 38:27 the way ordinary Linux or UNIX works 38:33 indeed all of this a lot of the story is 38:36 about in the context of a distributed 38:40 file system trying to mimic the 38:43 properties that ordinary unix-style 38:46 workstations have so that users won't be 38:49 surprised by frangipani it just sort of 38:51 works much the same way that they're 38:53 already used 38:57 all right so our next challenge is how 39:00 do you atomicity that is how to make it 39:04 so even though when I do a complex 39:05 operation like creating a file which 39:07 after all involves marking a new I 39:10 knowed as allocated initializing the 39:14 inode the I knows a little piece of data 39:15 that describes each file maybe 39:16 allocating space for the file adding a 39:19 new name in the directory for my new 39:21 file there's many steps so many things 39:23 that have to be updated we don't want 39:25 anybody to see any of the intermediate 39:27 steps we want people you know other 39:30 workstations to either see the file not 39:32 exist or completely exist but not 39:34 something in between one atomic 39:41 multi-step operations alright so in 39:56 order to implement this in order to make 39:58 multi-step operations like file create 40:01 or rename or delete atomic as far as 40:04 other workstations are concerned 40:05 frangipani has a implement the notion of 40:08 transactions that is as a complete sort 40:13 of database style transaction system 40:15 inside it again driven by the locks 40:20 furthermore it it's it's this is 40:22 actually distributed transaction system 40:26 and we'll see more we'll hear more about 40:28 distributed transaction systems later in 40:31 the course 40:31 there are like a very common requirement 40:34 and distributed systems the basic story 40:39 here is that frangipani makes it so that 40:43 other workstations can't see my 40:45 modifications until completely done by 40:47 an operation by first acquiring all the 40:50 locks on all the data that I'm going to 40:52 need to read or write during my 40:54 operation and not releasing any of those 40:57 locks until it's finished with the 41:00 complete operation and of course 41:02 following the coherence rule written all 41:05 of the modified data back to petal 41:08 so before I do an operation like 41:10 renaming like moving a file from one 41:12 directory to another which after all 41:13 modifies both directories and I don't 41:16 want anybody to see the file being in 41:18 either directory or something in the 41:20 middle of the operation in order to do 41:22 in order to do this French penny first 41:25 acquires require all the locks for the 41:31 operation then do everything like all 41:39 the updates right the frangipani so I 41:47 write to pedal and then release and of 41:55 course this is easy button and you know 41:57 since we already had the locking server 41:59 anyway in order to drive the cache 42:00 coherence protocol we buy just by you 42:04 know making sure we hold all the locks 42:05 for the entire duration of an operation 42:07 we get these indivisible atomic 42:10 transactions almost for free so an 42:18 interesting thing to know and that's 42:19 basically all there is to say about 42:20 making operations atomic and transit 42:23 Pandu's hold all the locks an 42:26 interesting thing about this use of 42:28 locks is that trends of pennies using 42:29 locks for - almost opposite purposes for 42:33 cache coherence frangipani uses the 42:36 locks to make sure that writes are 42:38 visible immediately to anybody who wants 42:41 to read them so this is all about using 42:43 locks essentially to kind of make sure 42:46 people can see writes this use the 42:49 blocks is all about making sure people 42:51 don't see the writes until I'm finished 42:53 with an operation because I hold all the 42:57 locks until all the rights have been 42:59 done so they're sort of playing an 43:01 interesting trick here by reusing the 43:04 locks they would have had to have anyway 43:06 for transactions in order to drive cache 43:09 coherence 43:12 all right so the next interesting thing 43:14 is crash recovery we need to cope with 43:24 the possibility the most interesting 43:27 possibility is that a workstation 43:29 crashes while holding locks and while in 43:33 the middle of some sort of complex set 43:35 of updates that is a reforestation 43:37 acquired a bunch of locks it's writing a 43:39 whole lot of data to maybe create or 43:40 delete files or something has possibly 43:42 written some of those modifications back 43:45 to pedal because maybe it was gonna soon 43:48 release locks or had been asked by the 43:50 lock server to release locks so it's 43:52 maybe done some of the rights back to 43:54 pedal for its complex operations but not 43:57 all of them and then crashes before 44:00 giving up the locks so that's the 44:02 interesting situation for crash recovery 44:07 so there's a number of things that that 44:09 don't work very well for workstation 44:11 crashes crashing we hope one thing that 44:30 doesn't work very well is to just 44:32 observe the workstations crashed and 44:35 just release all its locks because then 44:37 if it's done something like created a 44:41 new file and it's written the files 44:43 directory entry its name back to pedal 44:47 but it hasn't yet written the 44:48 initialized inode that describes the 44:51 file the inode may still be filled with 44:53 garbage or the previous file some 44:56 previous files information in petal and 44:58 yet we've already written the directory 45:00 entry so it's not okay to just release a 45:02 crashed file servers release of crash 45:05 were stations locks another thing that's 45:11 not okay is to not release the crashed 45:13 workstations locks you know that would 45:15 be correct because you know if it 45:17 crashed while in the middle of writing 45:20 out some of this modifications the fact 45:23 that it hadn't written out all of them 45:24 means a can't of release its locks 45:26 so simply not releasing its locks is 45:28 correct because it would hide the this 45:31 partial update from any readers and so 45:33 nobody would ever be confused by seeing 45:35 partially updated data structures in 45:38 petal on the other hand you know then 45:41 anybody you needed to use those files 45:42 would have to wait forever for the locks 45:44 if we simply didn't give them up so we 45:47 absolutely have to give up the locks in 45:48 order that other workstations can use 45:51 the system can use those same files and 45:53 directories but we have to do something 45:55 about the fact that the workstation 45:57 might have done some of the rights but 45:59 not all for its operations so frangipani 46:05 has like almost every other system that 46:10 needs to implement crashed recoverable 46:12 transactions users right ahead logging 46:22 this is something we've seen at least 46:25 one instance of the last lecture with 46:29 with aurora i was also using right ahead 46:33 logging so the idea is that if a 46:39 workstation needs to do a complex 46:41 operation that involves touching 46:43 updating many pieces of data in petal in 46:46 the file system the workstation well 46:48 first before it makes any rights to 46:51 petal append a la log entry to his log 46:57 in petal describing the full set of 47:00 operations it's about to do and only 47:03 when that log entry describing the full 47:07 set of operations is safely in petal 47:09 where now anybody else can see it only 47:11 then will the workstation start to send 47:15 the rights for the operation out to 47:17 petal I'm so we if it were station could 47:20 ever reveal even the one of its rights 47:23 for an operation the petal it must have 47:26 already put the log entry describing the 47:29 whole operation all of the updates must 47:32 already exist in petal so this is very 47:35 standard this is just a description of 47:37 right ahead logging 47:39 but there's a couple of odd aspects of 47:43 how frangipani implements right ahead 47:47 logging 47:48 the first one is that in most 47:51 transaction systems there's just one log 47:54 and all the transactions in the system 47:57 you know they're all sitting there in 47:59 one log in one place so there's a crash 48:02 and there's opera there's more than one 48:05 operation that affects the same piece of 48:07 data we have all of those operations for 48:10 that piece of data and everything else 48:11 right there in the single log sequence 48:14 and so we know for example which is the 48:16 most recent update to a given piece of 48:20 David but frangipani doesn't do that 48:22 this it has her work station logs as one 48:28 log per work station and there's 48:30 separate logs the other very interesting 48:34 thing about frangipane ease logging 48:36 system is that the LA workstation logs 48:39 are stored in petal and not on local 48:42 disk in almost every system that uses 48:44 logging the log is tightly associated 48:46 with whatever computer is running the 48:48 transactions that it's almost always 48:50 kept on a local disk but for extremely 48:54 good reasons 48:56 frangipani workstations 48:59 store their logs in petal in the shared 49:01 storage each workstation had its own 49:03 sort of semi-private log but it's stored 49:06 in petal storage where if the 49:09 workstation crashes its log can be 49:11 gotten that by other workstations so the 49:14 logs are in petal and this is this is 49:25 like separate logs for workstation 49:26 stored somewhere else in public sort of 49:30 shared storage so like a very 49:31 interesting and unusual arrangement all 49:34 right so we kind of need to know roughly 49:37 what's in the law what's in a log entry 49:50 and unfortunately the papers not super 49:53 explicit about the format of a log entry 49:56 but we can imagine that the well the 49:58 paper does say that each workstations 50:00 log sits in a known place a known range 50:04 of block numbers in petal and 50:06 furthermore that each workstation uses 50:09 its log space and petal on a kind of in 50:11 a circular way that it is all right log 50:13 entries along from the beginning and 50:15 when it hits the end the workstation 50:18 will go back and reuse its log space 50:21 back at the beginning of its log area 50:23 and of course that means that were 50:25 stations need to be able to you know 50:26 clean their logs so that sort of ensure 50:30 that a log entry isn't needed before 50:32 that space is reused and I'll talk about 50:35 that in a bit but each a log consists of 50:39 a sequence of log entries each log entry 50:42 has a log sequence number it's just an 50:47 increasing number each workstation 50:48 numbers it's log entries 1 2 3 4 5 and 50:53 the immediate reason for this may be the 50:56 only reason for this that the paper 50:58 mentions is that the the way that French 51:02 penny just detects the end of a work 51:04 stations log if the work station crashes 51:06 is by scanning for words in its log in 51:10 petal until it sees the increasing 51:14 sequence stop increasing and it knows 51:16 then that the log entry with the highest 51:18 log sequence number must be the very 51:20 last entry as it needs to be able to 51:23 detect the end of the log ok so we have 51:27 this log sequence number and then I 51:29 believe each log actually has an an 51:31 array 51:32 of descriptions of model aughh entry has 51:35 an array of the descriptions of the 51:38 modifications all the different 51:39 modifications that were involved in a 51:41 particular operation or an operation of 51:44 some a file system system call so each 51:48 entry in the array is going to have a 51:51 block number it's a block number in 51:53 petal there's a version number which 52:00 we'll get to in a bit and then there's 52:07 the data to be written and so there's a 52:12 bunch of these required to describe 52:18 operations that might touch more than 52:19 one piece of data in the file system one 52:23 thing to notice is that the log only 52:25 contains information about changes to 52:29 metadata that is to directories and 52:32 inodes and allocation bitmaps in the 52:37 file system the log doesn't actually 52:38 contain the data that is written to the 52:41 contents of files it doesn't contain the 52:43 user's data it just contains information 52:45 enough information to make the file 52:47 systems structures recoverable after a 52:51 crash so for example if I create a file 52:55 called F in a directory that's gonna 52:58 result in a new log entry that has two 53:01 little descriptions of modifications in 53:04 it one a description of how to 53:06 initialize the new files inode and in 53:08 another description of a new name to be 53:11 placed in the new files directory 53:17 alright so one thing I didn't mention so 53:21 of course the log is really a sequence 53:22 of these log entries 53:28 initially in order to be able to do 53:32 modifications as fast as possible 53:34 initially a friend Japanese workstations 53:37 log is only stored inside the 53:40 workstations own memory and won't be 53:43 written back to peddle until it has to 53:46 be and that's so that you know writing 53:49 anything including log entries to peddle 53:51 you know it takes a long time so we want 53:53 to avoid even writing log entries back 53:55 to peddle as well as writing dirty data 53:58 or modified blocks back to peddle we'd 54:00 like to avoid doing that as long as 54:02 possible so the real full story for what 54:07 happens when a workstation gets a revoke 54:13 message from the lock server seeing that 54:15 it has to give up a certain lock so on 54:28 right now this is the same you know this 54:31 is though compared sporto's 54:33 protocols revoke message if the 54:36 workstation gets a revoke message the 54:39 series of steps it must take is first 54:41 it's great that's the right any parts of 54:47 its log that are only in memory and 54:48 haven't yet been written to peddle it's 54:50 got to make sure as log is complete and 54:52 pedal as the first step so it writes 54:54 it's long and only then does it write 55:09 any updated blocks that are covered by 55:16 the lock that's being revoked so write 55:21 modified blocks 55:28 just for that provoked to lock and then 55:40 send a release message and the reason 55:48 for this sequencing and for this strict 55:50 ban is that these modifications if we 55:54 write them to peddle you know their 55:55 modifications to the data structure the 55:58 file system data structure and if we 56:00 were to crash midway through baby news 56:01 box just as usual we want to make sure 56:04 that some other workstation somebody 56:07 else there's enough information to be 56:09 able to complete the set of 56:12 modifications that the were station is 56:14 made even though the workstation has 56:16 crashed and maybe didn't finish doing 56:17 these rights and writing the log first 56:20 it's gonna be what allows us to 56:22 accomplish it these these log records 56:24 are a complete description of what these 56:26 modifications are going to be so first 56:28 we you know first we write though the 56:30 complete log to petal and then we 56:33 workstation can start writing its 56:35 modified blocks you know maybe it 56:37 crashes maybe doesn't hopefully not and 56:39 if it finishes writing as modified 56:41 blocks then it could send the release 56:43 back to the lock server so you know if 56:45 my workstation has modified a bunch of 56:46 files and then some other workstation 56:48 wants to read one of those files this is 56:50 the sequence that happens lock so ever 56:52 asked me for my locks right back my 56:54 workstation right back said log then 56:56 right back 56:58 writes the dirty modified blocks to 57:01 peddle and only then releases and then 57:03 the other workstation can acquire the 57:04 lock and read these blocks so that's 57:06 sort of the non crash you know if a 57:09 crash doesn't happen that is the 57:13 sequence of course it's only interesting 57:17 if a crash happens yes 57:21 [Music] 57:35 okay so for the log you're absolutely 57:38 right it writes the entire log and yeah 57:42 so so if if we get a revoke for a 57:44 particular file the workstation will 57:47 write its entire log and then only it's 57:53 only because it's only giving up the 57:54 lock for Z it it only needs to write 57:59 back data that's covered by Z so I have 58:01 to write the whole log just the data 58:03 that's covered by the lock that we 58:05 needed to give up and then we can 58:07 release that lock so yeah you know maybe 58:10 this writing the whole log might be 58:11 overkill like you if it turned out you 58:13 know so here's an optimization that you 58:15 might or might not care about if the 58:18 last modification for profile Z for the 58:21 lock were giving up is this one but 58:22 subsequent entries in my log didn't 58:25 modify that file then I could just write 58:27 just this prefix of my in-memory log 58:30 back to petal and you know be lazy about 58:33 writing the rest and that might see me 58:36 sometime 58:37 I might have to write the log back it's 58:41 actually not clear I would save us a lot 58:42 of time we have to write the log back at 58:43 some point anyway and yeah I think petal 58:47 just writes the whole thing okay okay so 58:53 now we can talk about what happens when 58:56 a workstation crashes while holding 58:58 locks right it's you know needs to 59:01 modify something rename a file create a 59:03 file whatever it's acquired all the 59:05 locks it needs it's modified some stuff 59:07 in its own cache to reflect these 59:13 operations maybe written some stuff back 59:17 to petal and then crashed men possibly 59:19 midway through writing so there's a 59:21 number of points at which it could crash 59:24 right because this is always the 59:26 sequence it always just always before 59:31 writing modified blocks from the cache 59:33 back 59:34 the frangipani will always have written 59:36 it's logged pedal first that means that 59:39 if a crash happens it's either while the 59:41 worst station is writing us log back to 59:43 pedal but before it's written any 59:45 modified file or directory' blocks back 59:48 or it crashes while it's writing these 59:51 modified block back but therefore 59:53 definitely after it's written in its 59:55 entire log and so that's a very 59:57 important you know but or maybe the 60:00 crash happened after it's completely 60:01 finished all of this so you know there's 60:05 only because of the sequencing there's 60:06 only a limited number of kind of 60:08 scenarios we made me worried about for 60:10 the crash okay so the workstations 60:15 crashed its crashed you know for like to 60:18 be exciting let's crash while Holdings 60:19 locks the first thing that happens the 60:21 lock server sends it a revoke request 60:23 and the lock server gets no response all 60:26 right that's what starts to trigger 60:27 anything where did nobody ever asked for 60:29 the lock 60:32 basically nobody's ever going to notice 60:34 that the workstation crashed so let's 60:35 assume somebody else wanted one of the 60:37 locks that the workstation had while it 60:40 was crashed and the lock service ended 60:42 revoke and it will never get a release 60:44 back from the workstation after a 60:47 certain amount of time has passed and it 60:49 turns out frangipani locks use leases 60:52 for a number of reasons so you know 60:53 after the least time has expired the 60:56 lock server will decide that the 60:58 workstation must have crashed and it 61:01 will initiate recovery and what that 61:02 really means is telling a different 61:04 workstation the lock server will tell 61:06 some other live workstation look 61:08 workstation one seems to have crashed 61:10 please go read it's log and replay all 61:16 of its recent operations to make sure 61:18 they're complete and tell me when you're 61:20 done and only then the lock servers 61:22 going to release the locks so okay so 61:29 and and this is the point at which it 61:32 was critical that the logs are in pedal 61:34 because some other workstation is going 61:36 to inspect the crash workstations log in 61:39 pedal 61:42 all right so what are the possibilities 61:45 one is that the worst that you can crash 61:47 before it ever wrote anything back and 61:49 so that means this other work station 61:51 doing recovery will look at the crash 61:53 workstation this log see that maybe 61:55 there's nothing in it at all and do 61:57 nothing and then release the locks the 62:00 workstation held now the worst that you 62:02 may have modified all kinds of things in 62:04 its cache but if it didn't write 62:06 anything to his log area then it 62:09 couldn't possibly have written any of 62:11 the blocks that have modified during 62:12 these operations right and so well we 62:16 will have lost the last few operations 62:19 that the workstation did the file system 62:22 is going to be consistent with the point 62:24 in time before that crashed workstation 62:27 started to modify anything because 62:30 apparently the workstation never even 62:31 got to the point where it was writing 62:33 log entries the next possibilities of 62:35 the workstation wrote some log entries 62:38 the log area and in that case the 62:40 recovering workstation will scan forward 62:43 from the beginning of log until it's 62:45 stopped seeing the log sequence numbers 62:48 increasing that's the point of where's 62:51 the log must Anton and the recovering 62:53 workstation we'll look at each of these 62:56 descriptions of a change and basically 62:58 play that change back into petal I'll 63:02 say oh you know there's certain block 63:04 number and petal needs to have some 63:06 certain data written to it which is just 63:08 the same modification that the crashed 63:10 workstation did in its own local cache 63:15 so the recovering workstation we'll just 63:17 consider each of these and replay each 63:19 of the crashed workstations log entries 63:24 back into petal and when it's done that 63:26 all the way to the end of a crashed 63:28 workstations log as it exists in petal 63:32 it'll tell the lock server and the lock 63:36 server will release the crashed 63:37 workstations locks and that will bring 63:42 the pedal up to date with some prefix of 63:46 the operations the crash workstation had 63:50 done before crashing maybe not all of 63:51 them because maybe it didn't write out 63:53 all of its log but the recovery were 63:55 season 63:56 won't replay anything in a log entry 63:58 unless it has the complete log entry in 64:01 petal and so you know implicitly that 64:05 means there's gonna be some sort of 64:06 checksum arrangement or something so the 64:08 recovery work station will know aha this 64:11 log entry is complete and not like 64:13 partially written that's quite important 64:14 because the whole point of this is to 64:17 make sure that only complete operations 64:21 are visible and petal and never never 64:24 never a partial operation so it's also 64:26 important that all the rights for a 64:30 given operation or a group together in 64:32 the log so that on recovery the recovery 64:34 workstation can do all of the rights for 64:38 an operation or none of them never half 64:42 of them ok so that's what happens if the 64:48 crash happens while the log is being 64:50 written back to petal a another 64:55 interesting possibility is that the 64:57 crash workstation crashed after writing 64:59 its log and also after writing some of 65:02 the blocks back itself and then crashed 65:04 and then skimming over some extremely 65:08 important details which I'll get to in a 65:09 moment then what will happen is again 65:11 the recovery workstation of course the 65:12 recovery where station doesn't know 65:13 really the point at which the 65:14 workstation crashed all it sees is oh 65:19 here's some log entries and again the 65:21 recovery workstation will replay the log 65:23 in the same way and more or less what's 65:29 going on is that yeah even if the 65:30 modifications were already done in petal 65:32 we're replaying the same modifications 65:35 here the recovery where students were 65:36 playing the same modifications it just 65:38 writes the same data the same place 65:40 again and presumably not really changing 65:43 the value for the writes that had 65:46 already been completed but if the crash 65:48 workstation hadn't done some of its 65:49 rights then some of these rights were 65:50 not sure which will actually change the 65:53 data to complete the operations all 66:00 right 66:03 that's not actually as it turns out the 66:06 full story and today's question sets up 66:12 a particular scenario for which a little 66:14 bit of added complexity is necessary in 66:22 particular the possibility that the 66:24 crashed workstation had actually gotten 66:27 through this entire sequence before 66:29 crashing and in fact released some of 66:31 its locks or so that it wasn't the last 66:37 person the last workstation to modify a 66:40 particular piece of data so an example 66:42 of this is what happens if we have some 66:44 workstation and it executes say a delete 66:50 file it deletes a file say a file F and 66:57 directory D and then there's some other 67:03 workstation which after this delete 67:07 creates a new file with the same name 67:09 but of course it's a different file now 67:12 so workstation 1 I'm sorry 67:15 workstation two later create create same 67:22 file same file name and then after that 67:27 workstation 1 crashes so we're going to 67:32 need you to do recovery on workstation 67:34 ones log and so at this point in time 67:38 you know maybe there's a third 67:39 workstation doing the recovery 67:45 so now workstation 3 is doing a recover 67:52 on workstation ones log so the sequence 67:56 says workstation 1 deleted a file or 67:58 station 2 created a file or station 3 68:00 does recovery well you know could be 68:04 that this delete is still in workstation 68:06 ones log so workstation two may you know 68:09 or station 1 crash just going to go or 68:11 station 3 is going to look at its log 68:13 that's going to replay all 68:14 all the updates in workstation ones log 68:19 this delete may the updates for this 68:21 delete the entry for this delete may 68:23 still be in workstation ones log so 68:25 unless we do something clever 68:27 workstation 3 is going to delete this 68:29 file you know because this this 68:32 operation erased the relevant entry from 68:34 the directory 68:35 thus actually erasing deleting this file 68:40 that's it's a different file that 68:41 workstation 2 created afterwards so 68:44 that's completely wrong alright what we 68:46 want you know the how come we want is 68:48 you know horse station one deleted a 68:50 file that file should be deleted but a 68:52 new file that if her name should not be 68:55 deleted just because it was a crash in a 68:56 restart cuz this create happen after 68:58 delete all right so we cannot just 69:01 replay workstation ones log without 69:05 further thought because it may it may 69:09 essentially a log entry in workstations 69:11 one log may be out of date by the time 69:13 it's we played during recovery some 69:16 other workstation may have modified the 69:17 same data and some other way 69:19 subsequently so we can't blindly replay 69:22 the log entries and so this is this is 69:26 this is today's question and the way 69:29 frangipani solves this is by associating 69:32 version numbers with every piece of data 69:36 in the file system as stored in pedal 69:39 and also associating the same version 69:42 number with every update that's 69:45 described in the log so every log entry 69:49 when well first I don't have any that's 69:53 you know say in pedal I'll just say in 69:59 pedal every piece of metadata every 70:02 inode every every piece of data that's 70:06 like the contents of a directory for 70:08 example every block of data metadata in 70:12 stored and pedal has a version number 70:14 when a workstation needs to modify a 70:19 piece of metadata in pedal it first 70:21 reads that metadata from pedal into its 70:23 memory and then looks at the existing 70:27 version of 70:28 and then when it's creating the log file 70:30 describing its modification it puts the 70:32 existing version number plus one into 70:36 the log entry and then when it in if it 70:41 does get a chance to write the data back 70:43 it'll write the data back with the new 70:45 increased version number so if over 70:48 station hasn't crashed and it did or if 70:51 it did manage to write some data back 70:52 before it crashed then the version 70:55 number has stored in petal for the 70:56 effected metadata it will be at least as 70:59 high or higher than the version numbers 71:02 stored in the log entry there will be 71:04 higher some other workstations 71:05 subsequently modified so what will 71:09 actually happen here is that the what 71:13 workstation 3 we'll see is that the log 71:17 entry for workstations one delete 71:20 operation will have a particular version 71:23 number stored in the log entry that 71:26 associated with the modification to the 71:28 directory let's say and the log entry 71:31 will say well the version number for the 71:33 directory and the new version number 71:35 created by this log entry is version 71:37 number three in order for workstation 71:40 two to subsequently change the directory 71:42 that is to add a file app in fact before 71:45 it crashed the workstation one must have 71:47 given up the lock in the directory and 71:49 that's probably why the log entry even 71:52 exists in pedal so workstation 1 must 71:55 have given up the lock apparently 71:56 workstation two got the lock and read 71:58 the current metadata for the directory 72:02 saw that the version number was three 72:04 now and when workstation two writes this 72:08 data it will set the version number or 72:14 the directory in peddle to be 4 ok so 72:19 the that means the log entry for this 72:22 delete operation is going to have 72:24 version number 3 in it now when the 72:28 recovery software on worst agent 3 72:31 replays workstation ones log it looks at 72:34 the version numbers first so it'll look 72:36 at the version number the log entry 72:37 it'll read the block from 72:40 look at the version number in the block 72:42 and if the version number in the block 72:44 in pedal is greater than or equal to the 72:47 version number in the log entry the 72:50 recovery software will simply ignore 72:51 that update in the log entry and not do 72:54 it because clearly the block had already 72:57 been written back by the crash 72:59 workstation and then maybe subsequently 73:01 modified by other workstations so the 73:05 replay is actually selectively based on 73:06 this version number that replay it's a 73:08 recovery only writes only 73:14 replays are right in the log if that 73:17 right is actually newer right in the log 73:20 entry is newer than the data that's 73:22 already stored in peddle so one sort of 73:31 irritating question here maybe is that 73:34 workstation three is running this 73:37 recovery software while other 73:39 workstations are still reading and 73:41 writing in the file system actively and 73:42 have locks and knows what to peddle so 73:46 the replay it's gonna go on while we're 73:50 station to which that doesn't know 73:52 anything about the recovery still active 73:54 and indeed workstation two may have the 73:57 lock for this directory 74:00 while recoveries going on so recovery 74:03 may be scanning the log and you no need 74:05 to read or write this directories data 74:08 in pedal while workstation two still has 74:11 the lock on this data the question is 74:14 how you know how do we sort this out 74:16 like one possibility which actually 74:19 turns out not to work is for the 74:22 recovery software to first acquire the 74:24 lock on anything that it needs to look 74:28 at in petal before while it's replaying 74:30 the log and the the you know one good 74:36 reason why that doesn't work is that it 74:38 could be that we're running recovery 74:39 after a system-wide power failure for 74:41 example in which all knowledge of who 74:43 had what locks is lost and therefore we 74:46 cannot write the recovery software to 74:49 sort of participate in the locking 74:52 protocol because 74:54 you know all knowledge of what's locked 74:56 my slot not locked may have been lost in 74:57 the power failure 74:58 um but luckily it turns out that the 75:01 recovery software can just go ahead and 75:03 read or write blocks in pedal without 75:07 worrying about sorry read or write data 75:09 in pedal without worrying at all about 75:11 locks and the reason is that if the 75:15 recovery software you know the recovery 75:16 software wants to replay this log entry 75:18 and possibly modify the data associated 75:20 with this directory it just goes ahead 75:22 and reads whatever's there for the 75:23 directory out of pedal right now and 75:26 there's really only two cases either the 75:28 crash workstation one had given up its 75:30 lock or it hadn't if it hadn't given up 75:33 this lock then nobody else can have a 75:35 directory locked and so there's no 75:36 problem if it had given up its lock then 75:39 before I gave it up its lock it must 75:42 have written that it's data for the 75:46 directory back to pedal and that means 75:50 that the version number stored in pedal 75:52 must be at least as high as the version 75:53 number in the crashed workstations log 75:56 entry and therefore when recovery 75:58 software compares the log entry version 76:00 number with the version number of the 76:02 data and pedal it'll see that the log 76:04 entry version number is not higher and 76:07 therefore won't we play the log entry so 76:11 yeah the recovery software will have 76:13 read the block without holding the lock 76:15 but it's not going to modify it because 76:18 if the locked was released the version 76:19 number will be high enough to show that 76:21 the log entry had already been sort of 76:26 processing to pedal before the crashed 76:28 workstation crashed no so there's no 76:31 locking issue alright this is the I've 76:41 gone over that kind of main guts of what 76:43 pedal is up to Nam it's cache coherence 76:46 it's distributed transactions and it's 76:49 distributed crash recovery the other 76:53 things to think about are the the paper 76:55 talks a bit about performance it's 76:57 actually very hard after over 20 years 77:00 to interpret performance numbers because 77:02 they brand their performance numbers on 77:04 very different Hardware in a very 77:06 different environment from 77:08 you see today roughly speaking the 77:11 performance numbers they show or that as 77:12 you add more and more friendship and 77:15 work stations the system basically 77:19 doesn't get slower that is each new 77:22 workstation even if it's actively doing 77:24 file system operations doesn't slow down 77:26 the existing workstation so in that 77:28 sense the system you know at least for 77:30 the application state look at the system 77:32 was giving them reasonable scalability 77:34 they could add more workstations without 77:36 slowing existing users down looking 77:42 backwards 77:44 although frangipani is full of like very 77:47 interesting techniques that are worth 77:49 remembering it didn't have too much 77:51 influence in on how on the evolution of 77:55 storage systems part of the reason is 77:58 that the environment for which is aimed 78:00 that is small workgroups 78:02 people sitting in front of workstations 78:04 on their desks and sharing files that 78:06 environment well it still exists in some 78:09 places isn't really where the action is 78:12 in distributed storage the action the 78:13 real action is moved into sort of big 78:15 data center or big websites big data 78:19 computations and there you know in that 78:22 world first of all the file system 78:24 interface just isn't very useful 78:25 compared to databases like people really 78:28 like transactions in the big website 78:31 world but they need them for very small 78:33 items of data the kind of data that you 78:35 would store in a database rather than 78:39 the kind of data that would you would 78:40 naturally store in a file system so you 78:44 know some of this technology might sort 78:47 of you can see echoes of it in modern 78:49 systems but it usually takes the form of 78:50 some database the other big kind of 78:53 storage this out there is storing big 78:56 files as needed for big data 78:59 computations like MapReduce and indeed 79:01 GFS is a you know to some extent looks 79:04 like a file system and is the kind of 79:06 storage system you want for MapReduce 79:08 but for GFS and for big data 79:12 computations frangipane ease you know 79:15 focus on local caching and workstations 79:19 and very close attention to 79:22 cache coherence and locking it's just 79:24 not very useful you know for both the 79:27 data read and write 79:29 typically caching is not useful at all 79:33 right if you're reading through ten 79:35 terabytes of data it's really 79:38 counterproductive almost to cache it so 79:41 a lot of the focus in frangipani is sort 79:45 of time is pass it by a little bit it's 79:47 still useful in some situations but it's 79:50 not what people are really thinking 79:52 about in designing new systems for all 79:56 right that is it