字幕記錄 00:00 I'd like to get started today we're 00:05 gonna talk about GFS the Google file 00:09 system paper we read for today 00:10 and this will be the first of a number 00:12 of different sort of case studies we'll 00:15 talk about in this course about how to 00:17 be build big storage systems so the 00:19 larger topic is big storage the reason 00:29 is the storage is turned out to be a key 00:31 abstraction you might you know if you 00:34 didn't know already you might imagine 00:35 that there could be all kinds of 00:37 different you know important 00:40 abstractions you might want to use for 00:42 distributed systems but it's turned out 00:43 that a simple storage interface is just 00:47 incredibly useful and extremely general 00:50 and so a lot of the thought that's gone 00:51 into building distributed systems has 00:53 either gone into designing storage 00:55 systems or designing other systems that 00:57 assume underneath them some sort of 01:00 reasonably well behaved big just 01:02 distributed storage system so we're 01:05 going to care a lot about how the you 01:07 know how to design a good interface to a 01:09 big storage system and how to design the 01:12 innards of the storage system so it has 01:14 good behavior you know of course that's 01:18 why we're reading this paper just to get 01:19 a start on that the this paper also 01:20 touches on a lot of themes that will 01:22 come up a lot in a tube for parallel 01:24 performance fault tolerance replication 01:27 and consistency and this paper is as 01:31 such things go reasonably 01:34 straightforward and easy to understand 01:36 it's also a good systems paper it sort 01:38 of talks about issues all the way from 01:40 the hardware to the software that 01:43 ultimately uses the system and it's a 01:45 successful real world design so it says 01:49 you know academic paper published in an 01:51 academic conference but it describes 01:53 something that really was successful and 01:54 used for a long time in the real world 01:57 so we sort of know that we're talking 01:58 about something that is it's a good a 02:02 good useful design okay so before I'm 02:07 gonna talk about GFS I want to sort of 02:09 talk about the space of distributed 02:11 storage systems a little bit 02:13 set the scene so first why is it hard 02:19 it's actually a lot to get right but for 02:23 a 2/4 there's a particular sort of 02:25 narrative that's gonna come up quite a 02:28 lot for many systems often the starting 02:32 point for people designing these sort of 02:34 big distributed systems or big storage 02:35 systems is they want to get huge 02:37 aggregate performance be able to harness 02:39 the resources of hundreds of machines in 02:43 order to get a huge amount of work done 02:44 so the sort of starting point is often 02:48 performance and you know if you start 02:54 there a natural next thought is well 02:57 we're gonna split our data over a huge 02:59 number of servers in order to be able to 03:00 read many servers in parallel so we're 03:04 gonna get and that's often called 03:05 sharding if you shard over many servers 03:11 hundreds or thousands of servers you're 03:13 just gonna see constant faults right if 03:15 you have thousands of servers there's 03:17 just always gonna be one down so we 03:20 defaults are just every day every hour 03:25 occurrences and we need automatic 03:27 weekend of humans involved and fixing 03:29 this fault we need automatic 03:31 fault-tolerant systems so that leads to 03:38 fault tolerance the among the most 03:43 powerful ways to get fault tolerance is 03:44 with replication just keep two or three 03:46 or whatever copies of data one of them 03:48 fails you can use another one so we want 03:52 to have tolerance that leads to 03:56 replication if you have replication two 04:03 copies the data then you know for sure 04:05 if you're not careful they're gonna get 04:07 out of sync and so what you thought was 04:09 two replicas of the data where you could 04:10 use either one interchangeably to 04:12 tolerate faults if you're not careful 04:14 what you end up with is two almost 04:15 identical replicas of the data that's 04:18 like not exactly replicas at all and 04:20 what you get back depends on which one 04:22 you talk to so that's starting to maybe 04:24 look a little bit 04:25 tricky for applications to use so if we 04:28 have replication we risk weird 04:34 inconsistencies of course clever design 04:41 you can get rid of inconsistency and 04:45 make the data look very well-behaved but 04:47 if you do that it almost always requires 04:49 extra work and extra sort of chitchat 04:51 between all the different servers and 04:53 clients in the network that reduces 04:54 performance so if you want consistency 04:59 you pay for with low performance I which 05:09 is of course not what we originally 05:11 hoping for of course this is an absolute 05:13 you can build very high performance 05:14 systems but nevertheless there's this 05:16 sort of inevitable way that the design 05:19 of these systems play out and it results 05:21 in a tension between the original goals 05:24 of performance and the sort of 05:26 realization that if you want good 05:29 consistency you're gonna pay for it and 05:31 if you don't want to pay for it then you 05:33 have to suffer with sort of anomalous 05:35 behavior sometimes I'm putting this up 05:37 because we're gonna see this this loop 05:39 many times for many of the systems we 05:42 look we look at people are we're rarely 05:45 willing to or happy about paying the 05:48 full cost of very good consistency ok so 05:52 you know with brought a consistency I'll 05:57 talk more later in the course about more 06:02 exactly what I mean by good consistency 06:04 but you can think of strong consistency 06:07 or good consistency as being we want to 06:09 build a system whose behavior to 06:11 applications or clients looks just like 06:13 you'd expect from talking to a single 06:15 server all right we're gonna build you 06:18 know systems out of hundreds of machines 06:20 but a kind of ideal strong consistency 06:23 model would be what you'd get if there 06:25 was just one server with one copy of the 06:26 data doing one thing at a time so this 06:31 is kind of a strong 06:34 consistency kind of intuitive way to 06:41 think about strong consistency so you 06:42 might think you have one server we'll 06:45 assume that's a single-threaded server 06:47 and that it processes requests from 06:49 clients one at a time and that's 06:50 important because there may be lots of 06:52 clients sending concurrently requests 06:55 into the server and see some current 06:57 requests it picks one or the other to go 06:59 first and excuse that request to 07:00 completion then excuse the nets so for 07:04 storage servers or you know the server's 07:06 got a disk on it and what it means to 07:07 process a request is it's a write 07:10 request you know which might be writing 07:12 an item or may be increment and I mean 07:14 incrementing an item if it's a mutation 07:17 then we're gonna go and we have some 07:21 table of data and you know maybe index 07:23 by keys and values and we're gonna 07:25 update this table and if the request 07:27 comes in and to read we're just gonna 07:28 you know pull the write data out of the 07:30 table one of the rules here that sort of 07:36 makes this well-behaved is that each is 07:39 that the server really does execute in 07:41 our simplified model excuse to request 07:44 one at a time and that requests see data 07:48 that reflects all the previous 07:49 operations in order so if a sequence of 07:51 writes come in and the server process 07:53 them in some order then when you read 07:55 you see the sort of you know value you 07:58 would expect if those writes that 08:00 occurred one at a time the behavior this 08:05 is still not completely straightforward 08:07 there's some you know there's some 08:09 things that you have to spend at least a 08:11 second thinking about so for example if 08:13 we have a bunch of clients and client 08:19 one issues a write of value X and wants 08:25 it to set it to one and at the same time 08:27 client two issues the right of the same 08:30 value but wants to set it to a different 08:32 the same key but wants to set it to a 08:34 different value right 08:35 something happens let's say client three 08:38 reads and get some result or client 08:42 three after these writes complete reads 08:44 get some result client four 08:47 reads X and get some also gets a result 08:50 so what results should the two clients 08:51 see yeah 09:04 well that's a good question so these 09:07 what I'm assuming here is that client 09:09 one inclined to launch these requests at 09:10 the same time so if we were monitoring 09:12 the network we'd see two requests 09:14 heading to the server at the same time 09:16 and then sometime later the server would 09:19 respond to them 09:20 so there's actually not enough here to 09:23 be able to say whether the client would 09:26 receipt would process the first request 09:28 first which order there's not enough 09:30 here to tell which order the server 09:32 processes them in and of course if it 09:35 processes this request first then that 09:38 means or it processes the right with 09:41 value to second and that means that 09:43 subsequent reads have to see to where is 09:46 it the server happened to process this 09:48 request first and this one's second that 09:50 means the resulting value better be one 09:52 and these these two requests and see 09:53 what so I'm just putting this up to sort 09:56 of illustrate that even in a simple 09:58 system there's ambiguity you can't 10:01 necessarily tell from trace of what went 10:04 into the server or what should come out 10:05 all of you can tell is that some set of 10:08 results is consistent or not consistent 10:11 with a possible execution so certainly 10:13 there's some completely wrong results we 10:17 can see go by it you know if client 3 10:21 sees a 2 then client 4 I bet had better 10:24 see it too also because our model is 10:27 well after the second right you know 10:29 climb trees these are two that means 10:30 this right must have been second and it 10:33 still had better be it still has to have 10:35 been the second right one client 4 goes 10:37 to the date so hopefully all this is 10:41 just completely straightforward and just 10:43 as expected because it's it's supposed 10:47 to be the intuitive model of strong 10:49 consistency ok and so the problem with 10:53 this of course is that a single server 10:54 has poor fault tolerance right if it 10:56 crashes or it's disk dies or something 10:57 we're left with nothing and so in the 11:00 real world of distributed systems we 11:02 actually build replicated systems so and 11:05 that's where all the problems start 11:06 leaking in is when we have a second 11:08 copying data so here is what must be 11:12 close to the worst replication design 11:16 and I'm doing this to warn you of the 11:19 problems that we will then be looking 11:20 for in GFS all right so here's a bad 11:23 replication design we're gonna have two 11:30 servers now each with a complete copy of 11:32 the data and so on disks that are both 11:38 gonna have this this table of keys and 11:40 values the intuition of course is that 11:44 we want to keep these tables we hope to 11:47 keep these tables identical so that if 11:49 one server fails we can read or write 11:51 from the other server and so that means 11:53 that somehow every write must be 11:55 processed by both servers and reads have 11:59 to be able to be processed by a single 12:00 server otherwise it's not fault tolerant 12:02 all right if reads have to consult both 12:04 and we can't survive the loss of one of 12:07 the servers okay so the problem is gonna 12:13 come up well I suppose we have client 1 12:17 and client 2 and they both want to do 12:19 these right say one of them gonna write 12:20 one and the other is going to write two 12:22 so client 1 is gonna launch it's right 12:25 x1 2 both because we want to update both 12:29 of them and climb 2 is gonna launch it's 12:32 write X so what's gonna go wrong here 12:41 yeah yeah we haven't done anything here 12:46 to ensure that the two servers process 12:48 the two requests in the same order right 12:51 that's a bad design 12:53 so if server 1 processes client ones 12:57 request first it'll end up it'll start 13:01 with a value of 1 and then it'll see 13:02 client twos request and overwrite that 13:04 with 2 if server 2 just happens to 13:07 receive the packets over the network in 13:09 a different order it's going to execute 13:11 client 2's requests and set the value to 13:13 2 and then then it will see client ones 13:15 request set the value to 1 and now what 13:18 a client a later reading client sees you 13:20 know if client 3 happens to reach from 13:22 this server and client for happens to 13:25 reach from the other server then we get 13:26 into this terrible situation where 13:28 they're gonna read different values even 13:30 though our intuitive model of a correct 13:33 service says they both subsequent reads 13:35 hefty you're the same value and this can 13:39 arise in other ways you know suppose we 13:41 try to fix this by making the clients 13:43 always read from server one if it's up 13:45 and otherwise server two if we do that 13:48 then if this situation happened and four 13:51 why oh yeah both everybody reads might 13:53 see client might see value too but a 13:55 server one suddenly fails then even 13:57 though there was no right suddenly the 14:00 value for X we'll switch from 2 to 1 14:02 because if server 1 died it's all the 14:04 clients assistant server 2 no but just 14:07 this mysterious change in the data that 14:09 doesn't correspond to any right which is 14:11 also totally not something that could 14:13 have happened in this service simple 14:15 server model all right so of course this 14:23 can be fixed the fix requires more 14:25 communication usually between the 14:28 servers or somewhere more complexity and 14:33 because of the cost of inevitable cost 14:36 to the complexity to get strong 14:37 consistency there's a whole range of 14:41 different solutions to get better 14:43 consistency and a whole range of what 14:45 people feel is an acceptable level of 14:48 consistency in an acceptable sort of a 14:52 set of anomalous behaviors that might be 14:54 revealed all right any questions about 14:57 this disastrous model here 15:04 okay that's what you're talking about 15:07 GFS a lot of thought about doing GFS was 15:13 doing is fixing this they had better but 15:17 not perfect behavior okay so where GFS 15:21 came from in 2003 quite a while ago 15:24 actually at that time the the web you 15:27 know was certainly starting to be a very 15:29 big deal and people are building big 15:31 websites in addition there had been 15:35 decades of research into distributed 15:37 systems and people sort of knew at least 15:39 at the academic level how to build all 15:40 kinds of highly parallel fault tolerant 15:43 whatever systems but there been very 15:44 little use of academic ideas in industry 15:49 but starting at around the time this 15:52 paper was published big websites like 15:54 Google started to actually build serious 15:57 distributed systems and it was like very 16:01 exciting for people like me who were I'm 16:03 a kid I'm excited this to see see real 16:06 uses of these ideas where Google was 16:10 coming from was you know they had some 16:11 vast vast data sets far larger than 16:14 could be stored in a single disk like an 16:16 entire crawl copy of the web or a little 16:20 bit after this paper they had giant 16:22 YouTube videos they had things like the 16:25 intermedia files for building a search 16:27 index 16:28 they also apparently kept enormous log 16:30 files from all their web servers so they 16:32 could later analyze them so they had 16:34 some big big data sets they used both to 16:36 store them and many many disks to store 16:39 them and they needed to be able to 16:41 process them quickly with things like 16:42 MapReduce so they needed high speed 16:44 parallel access to these vast amounts of 16:47 data okay so what they were looking for 16:51 one goal was just that the thing be big 16:53 and fast they also wanted a file system 17:00 that was sort of global in the sense 17:02 that many different applications could 17:04 get at it one way to build a big storage 17:06 system is to you know you have some 17:07 particular application or mining you 17:09 build storage sort of dedicated and 17:11 tailored to that application and if 17:13 somebody else in the next office needs 17:14 big storage well they can build their 17:17 own thing 17:17 right but if you have a universal or 17:21 kind of global reusable storage system 17:25 and that means that if I store a huge 17:28 amount of data si you know I'm crawling 17:29 the web and you want to look at my 17:31 crawled web web pages because we're all 17:35 using we're all playing in the same 17:36 sandbox we're all using the same storage 17:38 system you can just read my files you 17:40 know maybe access controls permitting so 17:43 the idea was to build a sort of file 17:45 system where anybody you know anybody 17:47 inside Google could name and read any of 17:50 the files to allow sharing in order to 17:57 get a in order to get bigness and 17:58 fastness they need to split the data 18:00 through every file will be automatically 18:04 split by GFS over many servers so that 18:07 writes and reads would just 18:08 automatically be fast as long as you 18:10 were reading from lots and lots of 18:12 reading a file from lots of clients you 18:14 get high aggregate throughput and also 18:17 be able to for a single file be able to 18:20 have single files that were bigger than 18:21 any single disk because we're building 18:24 something out of hundreds of servers we 18:26 want automatic feel your recovery we 18:36 don't want to build a system where every 18:37 time one of our hundreds of servers a 18:38 fail some human being has to go to the 18:40 machine room and do something with the 18:42 server or to get it up and running or 18:44 transfers data or something well this 18:46 isn't just fix itself um there were some 18:50 sort of non goals like one is that GFS 18:54 was designed to run in a single data 18:55 center so we're not talking about 18:57 placing replicas all over the world a 18:59 single GFS installation just lived in 19:02 one one data center one big machine run 19:05 so getting this style system to work 19:12 where the replicas are far distant from 19:14 each other is a valuable goal but 19:17 difficult so single data centers this is 19:22 not a service to customers GFS was for 19:25 internal use by 19:27 applications written by Google engineers 19:30 so it wasn't they weren't directly 19:32 selling this they might be selling 19:33 services they used GFS internally but 19:37 they weren't selling it directly so it's 19:38 just for internal use and it was 19:45 tailored in a number of ways for big 19:48 sequential file reads and writes there's 19:51 a whole nother domain like a system of 19:54 storage systems that are optimized for 19:56 small pieces of data like a bank that's 19:58 holding bank balances probably wants a 20:00 database that can read and write an 20:02 update you know 100 byte records that 20:04 hold people's bank balances but GFS is 20:07 not that system so it's really for big 20:10 or big is you know terabytes gigabytes 20:12 some big sequential not random access 20:22 it's also that has a certain batch 20:24 flavor there's not a huge amount of 20:26 effort to make access be very low 20:27 latency the focus is really on 20:30 throughput of big you know multi 20:32 megabyte operations this paper was 20:36 published at s OSP in 2003 the top 20:39 systems academic conference yeah usually 20:46 the standard for papers such conferences 20:49 they have you know a lot of very novel 20:51 research this paper was not necessarily 20:54 in that class the specific ideas in this 20:55 paper none of them are particularly new 20:57 at the time and things like distribution 21:00 and sharding and fault tolerance were 21:02 you know well understood had to had to 21:05 deliver those but this paper described a 21:07 system that was really operating in in 21:09 use at a far far larger scale hundreds 21:11 of thousands of machines much bigger 21:13 than any you know academics ever built 21:16 the fact that it was used in industry 21:18 and reflected real world experience of 21:21 like what actually didn't didn't work 21:23 for deployed systems that had to work 21:25 and had to be cost effective also like 21:28 extremely valuable the paper sort of 21:34 proposed a fairly heretical view that it 21:39 was okay for the storage system to have 21:40 pretty 21:41 consistency we the academic mindset at 21:45 that time was the you know the storage 21:46 system really should have good behavior 21:47 like what's the point of building 21:48 systems that sort of return the wrong 21:50 data like my terrible replication system 21:53 like why do that why not build systems 21:55 return the right data correct data 21:57 instead of incorrect data now with this 21:59 paper actually does not guarantee return 22:02 correct data and you know the hope is 22:05 that they take advantage of that in 22:07 order to get better performance I'm a 22:09 final thing that was sort of interesting 22:11 about this paper is its use of a single 22:13 master in a sort of academic paper you 22:16 probably have some fault-tolerant 22:18 replicated automatic failure recovering 22:20 master perhaps many masters with the 22:24 work split open um but this paper said 22:25 look you know you they can get away with 22:26 a single master and it worked fine well 22:39 cynically you know who's going to notice 22:40 on the web that some vote count or 22:43 something is wrong or if you do a search 22:44 on a search engine now you're gonna know 22:47 that oh you know like one of 20,000 22:50 items is missing from the search results 22:51 or they're in the wrong order probably 22:54 not so there was just much more 22:58 tolerance in these kind of systems than 22:59 there would like in a bank for incorrect 23:02 data it doesn't mean that all data and 23:04 websites can be wrong like if you're 23:05 charging people for ad impressions you 23:07 better get the numbers right but this is 23:09 not really about that in addition some 23:15 of the ways in which GFS could serve up 23:18 odd data could be compensated for in the 23:21 applications like where the paper says 23:23 you know applications should accompany 23:25 their data with check sums and clearly 23:28 mark record boundaries that's so the 23:30 applications can recover from GFS 23:32 serving them maybe not quite the right 23:35 data 23:40 all right so the general structure and 23:44 this is just figure one in the paper so 23:48 we have a bunch of clients hundreds 23:53 hundreds of clients we have one master 23:59 although there might be replicas of the 24:02 master the master keeps the mapping from 24:07 file names to where to find the data 24:09 basically although there's really two 24:10 tables so and then there's a bunch of 24:14 chunk servers maybe hundreds of chunk 24:18 servers each with perhaps one or two 24:21 discs the separation here's the master 24:23 is all about naming and knowing where 24:25 the chunks are and the chunk servers 24:27 store the actual data this is like a 24:29 nice aspect of the design that these two 24:31 concerns are almost completely separated 24:32 from each other and can be designed just 24:35 separately with separate properties the 24:41 master knows about all the files for 24:43 every file the master keeps track of a 24:44 list of chunks chunk identifiers that 24:48 contain the successive pieces that file 24:50 each chunk is 64 megabytes so if I have 24:53 a you know gigabyte file the master is 24:57 gonna know that maybe the first chunk is 24:58 stored here and the second chunk is 25:00 stored here the third chunk is stored 25:01 here and if I want to read whatever part 25:03 of the file I need to ask the master oh 25:05 which server hole is that chunk and I go 25:07 talk to that server and read the chunk 25:09 roughly speaking all right so more 25:17 precisely we need to turns out if we're 25:21 going to talk about how the system about 25:23 the consistency of the system and how it 25:24 deals with false we need to know what 25:27 the master is actually storing in a 25:29 little bit more detail so the master 25:31 data 25:36 it's got two main tables that we care 25:38 about it's got one table that map's file 25:41 name to an array of chunk IDs or chunk 25:52 handles this just tells you where to 26:00 find the data or what the what the 26:03 identifiers are the chunks are so it's 26:05 not much yet you can do with a chunk 26:06 identifier but the master also happens 26:08 to have a a second table that map's 26:11 chunk handles each chunk handle to a 26:17 bunch of data about that chunk so one is 26:21 the list of chunk servers that hold 26:23 replicas of that data each chunk is 26:25 stored on more than one chunk server so 26:28 it's a list chunk servers every chunk 26:39 has a current version number so this 26:42 master has a remembers the version 26:46 number for each chunk all rights for a 26:50 chunk have to be sequence ooh the chunks 26:51 primary it's one of the replicas so 26:54 master remembers the rich chunk servers 26:58 the primary and there's also that 27:00 primary is only allowed to be primary 27:02 for a certain least time so the master 27:05 remembers the expiration time of the 27:13 lease this stuff so far it's all in RAM 27:17 and the master so just be gone if the 27:19 master crashed so in order that you'd be 27:24 able to reboot the master and not forget 27:26 everything about the file system the 27:29 master actually stores all of this data 27:30 on disk as well as in memory so reads 27:35 just come from memory but writes to at 27:38 least the parts of this data that had to 27:40 be reflected on this writes have to go 27:42 to the disk so and the way it actually 27:45 managed that is that there's all 27:47 the master has a log on disk and every 27:51 time it changes the data it appends an 27:53 entry to the log on disk and checkpoint 28:04 so some of this stuff actually needs to 28:07 be on disk and some doesn't it turns out 28:10 I'm guessing a little bit here but 28:12 certainly the array of chunk handles has 28:16 to be on disk and so I'm gonna write env 28:18 here for non-volatile meaning it it's 28:20 got to be reflected on disk the list of 28:22 chunk servers it turns out doesn't 28:25 because the master if it reboots talks 28:28 to all the chunk servers and ask them 28:29 what chunks they have so this is I 28:32 imagine not written to disk the version 28:36 number any guesses written to disk not 28:38 written to disk requires knowing how the 28:42 system works I'm gonna vote written to 28:51 disk non-volatile we can argue about 28:55 that later when we talk about how system 28:57 works identity the primary it turns out 29:04 not almost certainly not written to disk 29:06 so volatile and the reason is the master 29:10 is um reboots and forgets therefore 29:13 since it's volatile forgets who the 29:15 primary is for a chunk it can simply 29:17 wait for the 62nd lease expiration time 29:19 and then it knows that absolutely no 29:21 primary will be functioning for this 29:23 chunk and then it can designate a 29:24 different primary safely and similarly 29:27 the lease expiration stuff is volatile 29:29 so that means that whenever a file is 29:32 extended with a new chunk goes to the 29:35 next 64 megabyte boundary or the version 29:40 number changes because the new primary 29:42 is designated that means that the master 29:45 has to first append a little record to 29:48 his log basically saying oh I just added 29:50 a such-and-such a chunk to this file or 29:53 I just changed the version number so 29:56 every time I change is one of those that 29:57 needs to writes right it's disk so this 29:59 is paper doesn't talk about this 30:00 much but you know there's limits the 30:02 rate at which the master can change 30:05 things because you can only write your 30:07 disk however many times per second and 30:09 the reason for using a log rather than a 30:12 database you know some sort of b-tree or 30:16 hash table on disk is that you can 30:20 append to a log very efficiently because 30:24 you only need you can take a bunch of 30:26 recent log records they need to be added 30:28 and sort of write them all on a single 30:29 write after a single rotation to 30:32 whatever the point in the disk is that 30:33 contains the end of the log file whereas 30:36 if it were a sort of b-tree reflecting 30:38 the real structure of this data then you 30:42 would have to seek to a random place in 30:43 the disk and do a little right so the 30:45 log makes a little bit faster to write 30:46 there to reflect operations on to the 30:51 disk however if the master crashes and 30:56 has to reconstruct its state you 30:58 wouldn't want to have to reread its log 31:00 file back starting from the beginning of 31:02 time from when the server was first 31:04 installed you know a few years ago so in 31:06 addition the master sometimes 31:08 checkpoints its complete state to disk 31:10 which takes some amount of time seconds 31:15 maybe a minute or something and then 31:17 when it restarts what it does is goes 31:20 back to the most recent checkpoint and 31:21 plays just the portion of a log that 31:24 sort of starting at the point in time 31:26 when that check one is created any 31:30 questions about the master data okay 31:40 so with that in mind I'm going to lay 31:44 out the steps in a read and the steps in 31:46 the right 31:46 where all this is heading is that I then 31:49 want to discuss you know for each 31:50 failure I can think of why does the 31:53 system or does the system act directly 31:56 after that failure um but in order to do 31:58 that we need to understand the data and 32:00 operations in the data okay so if 32:03 there's a read the first step is that 32:11 the client and what a read means that 32:12 the application has a file name in mind 32:14 and an offset in the file that it wants 32:17 to read some data front so it sends the 32:19 file name and the offset to the master 32:21 and the master looks up the file name in 32:23 its file table and then you know each 32:25 chunk is 64 megabytes who can use the 32:28 offset divided by 64 megabytes to find 32:30 which chunk and then it looks at that 32:33 chunk in its chunk table finds the list 32:39 of chunk servers that have replicas of 32:41 that data and returns that list to the 32:44 client so the first step is so you know 32:52 the file name and the offset the master 32:56 and the master sends the chunk handle 33:05 let's say H and the list of servers so 33:11 now we have some choice we can ask any 33:13 one of these servers pick one that's and 33:15 the paper says that clients try to guess 33:17 which server is closest to them in the 33:19 network maybe in the same rack and send 33:23 the read request to that to that replica 33:28 the client actually caches 33:35 cassia's this result so that if it reads 33:37 that chunk again and indeed the client 33:39 might read a given chunk in you know one 33:41 megabyte pieces or 64 kilobyte pieces or 33:45 something so I may end up reading the 33:47 same chunk different points successive 33:49 regions of a chunk many times and so 33:51 caches which server to talk to you for 33:56 giving chunks so it doesn't have to keep 33:57 beating on the master asking the master 33:59 for the same information over and over 34:03 now the client talks to one of the chunk 34:07 servers tells us a chunk handling offset 34:12 and the chunk servers store these chunks 34:16 each chunk in a separate Linux file on 34:19 their hard drive in a ordinary Linux 34:21 file system and presumably the chunk 34:24 files are just named by the handle so 34:26 all the chunk server has to do is go 34:28 find the file with the right name you 34:31 know I'll give it that 34:33 entire chunk and then just read the 34:35 desired range of bytes out of that file 34:38 and return the data to the client I hate 34:46 question about how reads operate can I 34:51 repeat number one the step one is the 34:54 application wants to read it a 34:57 particular file at a particular offset 35:00 within the file a particular range of 35:02 bytes in the files and one thousand two 35:04 two thousand and so it just sends a name 35:05 of the file and the beginning of the 35:09 byte range to the master and then the 35:12 master looks a file name and it's file 35:14 table to find the chunk that contains 35:18 that byte range for that file so good 35:30 [Music] 35:34 so I don't know the exact details my 35:36 impression is that the if the 35:38 application wants to read more than 64 35:40 megabytes or even just two bytes but 35:42 spanning a chunk boundary that the 35:44 library so the applications linked with 35:47 a library that sends our pcs to the 35:52 various servers and that library would 35:54 notice that the reads spanned a chunk 35:56 boundary and break it into two separate 35:58 reads and maybe talk to the master I 36:01 mean it may be that you could talk to 36:02 the master once and get two results or 36:04 something but logically at least it two 36:06 requests to the master and then requests 36:08 to two different chunk servers yes well 36:19 at least initially the client doesn't 36:21 know for a given file 36:26 what chunks need what chunks well it can 36:35 calculate it needs the seventeenth chunk 36:37 but but then it needs to know what chunk 36:40 server holds the seventeenth chunk of 36:42 that file and for that it certainly 36:44 needs for that it needs to talk to the 36:47 master okay so all right did I'm not 36:58 going to make a strong claim about which 36:59 of them decides that it was the 37:01 seventeenth chunk in the file but it's 37:03 the master that finds the identifier of 37:06 the handle of the seventeenth chunk in 37:07 the file looks that up in its table and 37:09 figures out which chunk servers hold 37:12 that chunk yes 37:25 how does that or you mean if the if the 37:35 client asks for a range of bytes that 37:38 spans a chunk boundary yeah so the the 37:46 well you know the client will ask that 37:49 well the clients linked with this 37:50 library is a GFS library that noticed 37:52 how to take read requests apart and put 37:56 them back together and so that library 38:00 would talk to the master and the master 38:01 would tell it well well you know chunk 38:02 seven is on this server and chunk eight 38:05 is on that server and then why the 38:07 library would just be able to say oh you 38:09 know I need the last couple bites of 38:10 chunk seven and the first couple bites 38:12 of chunk eight and then would fetch 38:15 those put them together in a buffer and 38:17 return them to the calling application 38:26 well the master tells it about chunks 38:28 and the library kind of figures out 38:30 where it should look in a given chunk to 38:32 find the date of the application wanded 38:34 the application only thinks in terms of 38:36 file names and sort of just offsets in 38:38 the entire file in the library and the 38:41 master conspire to turn that into chunks 38:45 yeah 38:50 sorry let me get closer here you see 38:55 again so the question is does it matter 39:03 which chunk server you reach room so you 39:06 know yes and no notionally they're all 39:08 supposed to be replicas in fact as you 39:13 may have noticed or as we'll talk about 39:14 they're not you know they're not 39:17 necessarily identical and applications 39:20 are supposed to be able to tolerate this 39:21 but the fact is that you make a slightly 39:23 different data depending on which 39:24 replicas you read yeah so the paper says 39:28 that clients try to read from the chunk 39:32 server that's in the same rack or on the 39:34 same switch or something all right so 39:44 that's reads 39:48 the rights are more complex and 39:51 interesting now the application 40:02 interface for rights is pretty similar 40:04 there's just some call some library you 40:06 call to mate you make to the gfs client 40:08 library saying look here's a file name 40:10 and a range of bytes I'd like to write 40:12 and the buffer of data that I'd like you 40:14 to write to that that range actually let 40:17 me let me backpedal I only want to talk 40:19 about record of pens and so I'm going to 40:23 praise this the client interface as the 40:26 client makes a library call that says 40:28 here's a file name and I'd like to 40:29 append this buffer of bytes to the file 40:32 I said this is the record of pens that 40:35 the paper talks about so again the 40:42 client asks the master look I want to 40:47 append sends a master requesting what I 40:49 would like to pen to this named file 40:51 please tell me where to look for the 40:55 last chunk in the file because the 40:56 client may not know how long the file is 40:58 if lots of clients are opinion to the 41:00 same file because we have some big file 41:02 this logging stuff from a lot of 41:04 different clients may be you know no 41:06 client will necessarily know how long 41:08 the file is and therefore which offset 41:10 or which chunk it should be appending to 41:12 so you can ask the master please tell me 41:14 about the the server's that hold the 41:16 very last chunk 41:18 current chunk in this file so 41:22 unfortunately now the writing if you're 41:26 reading you can read from any up-to-date 41:27 replica for writing though there needs 41:30 to be a primary so at this point on the 41:32 file may or may not have a primary 41:35 already designated by the master so we 41:37 need to consider the case of if there's 41:39 no primary already and all the master 41:40 knows well there's no primary so so one 41:49 case is no primary 41:57 in that case the master needs to find 42:00 out the set of chunk servers that have 42:03 the most up-to-date copy of the chunk 42:06 because know if you've been running the 42:08 system for a long time due to failures 42:10 or whatever there may be chunk servers 42:11 out there that have old copies of the 42:14 chunk from you know yesterday or last 42:15 week that I've been kept up to kept up 42:17 to date because maybe that server was 42:19 dead for a couple days and wasn't 42:21 receiving updates so there's you need to 42:23 be able to tell the difference between 42:24 up-to-date copies of the chunk and non 42:27 up-to-date so the first step is to find 42:33 you know find up-to-date this is all 42:37 happening in the master because the 42:41 client has asked the master told the 42:42 master look I want up end of this file 42:44 please tell me what chunk service to 42:46 talk to so a part of the master trying 42:48 to figure out what chunk servers the 42:49 client should talk to you 42:50 so when we finally find up-to-date 42:52 replicas and what update means is a 42:59 replica whose version of the chunk is 43:02 equal to the version number that the 43:04 master knows is the most up-to-date 43:06 version number it's the master that 43:08 hands out these version numbers the 43:10 master remembers that oh for this 43:14 particular chunk you know the trunk 43:18 server is only up to date if it has 43:19 version number 17 and this is why it has 43:21 to be non-volatile stored on disk 43:23 because if if it was lost in a crash and 43:26 there were chunk servers holding stale 43:31 copies of chunks the master wouldn't be 43:33 able to distinguish between chunk 43:35 servers holding stale copies of a chunk 43:36 from last week and a chunk server that 43:39 holds the copy of the chunk that was 43:42 up-to-date as of the crash that's why 43:44 the master members of version number on 43:46 disk yeah 43:54 if you knew you were talking to all the 43:56 chunk servers okay so the observation is 43:59 the master has to talk to the chunk 44:02 servers anyway if it reboots in order to 44:04 find which chunk server holds which 44:06 chunk because the master doesn't 44:08 remember that so you might think that 44:12 you could just take the maximum you 44:14 could just talk to the chunk servers 44:15 find out what trunks and versions they 44:17 hold and take the maximum for a given 44:19 chunk overall the responding chunk 44:20 servers and that would work if all the 44:22 chunk servers holding a chunk responded 44:24 but the risk is that at the time the 44:26 master reboots maybe some of the chunk 44:28 servers are offline or disconnected or 44:30 whatever themselves rebooting and don't 44:32 respond and so all the master gets back 44:35 is responses from chunk servers that 44:38 have last week's copies of the block and 44:40 the chunk servers that have the current 44:42 copy haven't finished rebooting or 44:44 offline or something so ok oh yes if if 44:54 the server's holding the most recent 44:56 copy are permanently dead if you've lost 44:59 all copies all of the most recent 45:02 version of a chunk then yes 45:09 No 45:11 okay so the question is the master knows 45:15 that for this chunk is looking for 45:17 version 17 45:18 supposing it finds no chunk server you 45:21 know and it talks to the chunk servers 45:22 periodically to sort of ask them what 45:24 chunks do you have what versions you 45:25 have supposing it finds no server with 45:27 chunk 17 with version 17 for this this 45:30 chunk then the master will either say 45:32 well either not respond yet and wait or 45:35 it will tell the client look I can't 45:39 answer that try again later and this 45:42 would come up like there was a power 45:44 failure in the building and all the 45:45 server's crashed and we're slowly 45:47 rebooting the master might come up first 45:49 and you know some fraction of the chunk 45:51 servers might be up and other ones would 45:53 reboot five minutes from now but so we 45:57 ask to be prepared to wait and it will 45:59 wait forever because you don't want to 46:02 use a stale version of that of a chunk 46:05 okay so the master needs to assemble the 46:09 list of chunk servers that have the most 46:10 recent version the master knows the most 46:12 recent versions stored on disk each 46:14 chunk server along with each chunk as 46:16 you pointed out also remembers the 46:18 version number of the chunk that it's 46:19 stores so that when chunk slivers 46:22 reported into the master saying look I 46:23 have this chunk the master can ignore 46:25 the ones whose version does not match 46:27 the version the master knows is the most 46:30 recent okay so remember we were the 46:34 client want to append the master doesn't 46:36 have a primary it figures out maybe you 46:39 have to wait for the set of chunk 46:42 servers that have the most recent 46:43 version of that chunk it picks a primary 46:50 so I'm gonna pick one of them to be the 46:52 primary and the others to be secondary 46:56 servers 46:56 among the replicas set at the most 46:58 recent version the master then 47:02 increments 47:07 the version number and writes that to 47:11 disk so it doesn't forget it the crashes 47:13 and then it sends the primary in the 47:15 secondaries and that's each of them a 47:18 message saying look for this chunk 47:20 here's the primary here's the 47:22 secondaries you know recipient maybe one 47:26 of them and here's the new version 47:28 number so then it tells primary 47:32 secondaries this information plus the 47:37 version number the primaries and 47:39 secondaries 47:39 alright the version number to disk so 47:41 they don't forget because you know if 47:43 there's a power failure or whatever they 47:45 have to report in to the master with the 47:47 actual version number they hold yes 48:04 that's a great question 48:06 so I don't know there's hints in the 48:08 paper that I'm slightly wrong about this 48:11 so the paper says I think your question 48:14 was explaining something to me about the 48:16 paper the paper says if the master 48:18 reboots and talks to chunk servers and 48:22 one of the chunk servers reboot reports 48:24 a version number that's higher than the 48:26 version number the master remembers the 48:28 master assumes that there was a failure 48:31 while it was assigning a new primary and 48:34 adopts the new the higher version number 48:36 that it heard from a chunk server so it 48:38 must be the case that in order to handle 48:42 a master crash at this point that the 48:48 master writes its own version number to 48:55 disk after telling the primaries there's 49:02 a bit of a problem here though because 49:03 if the was that is there an ACK 49:12 all right so maybe the master tells the 49:17 primaries and backups and that their 49:18 primaries and secondaries if they're a 49:20 primary secondary tells him the new 49:21 version number waits for the AK and then 49:24 writes to disk or something unsatisfying 49:27 about this I don't believe that works 49:37 because of the possibility that the 49:40 chunk servers with the most recent 49:41 version numbers being offline at the 49:44 time the master reboots we wouldn't want 49:46 the master the master doesn't know the 49:48 current version number it'll just accept 49:50 whatever highest version number adheres 49:51 which could be an old version number all 49:54 right so this is a an area of my 49:57 ignorance I don't really understand 49:58 whether the master update system version 50:00 number on this first and then tells the 50:01 primary secondary or the other way 50:03 around and I'm not sure it works either 50:06 way okay but in any case one way or 50:11 another the master update is version 50:12 number tells the primary secondary look 50:14 your primaries and secondaries here's a 50:16 new version number and so now we have a 50:17 primary which is able to accept writes 50:19 all right that's what the primaries job 50:21 is to take rights from clients and 50:23 organize applying those rights to the 50:26 various chunk servers and you know the 50:35 reason for the version number stuff is 50:36 so that the master will recognize the 50:44 which servers have this new you know the 50:50 master hands out the ability to be 50:52 primary for some chunk server we want to 50:55 be able to recognize if the master 50:58 crashes you know that it was that was 51:01 the primary that only that primary and 51:03 it secondaries which were actually 51:05 processed which were in charge of 51:06 updating that chunk that only those 51:08 primaries and secondaries are allowed to 51:10 be chunk servers in the future and the 51:12 way the master does this is with this 51:14 version number logic 51:17 okay so the master tells the primaries 51:21 and secondaries that there it they're 51:23 allowed to modify this block it also 51:24 gives the primary a lease which 51:27 basically tells the primary look you're 51:29 allowed to be primary for the next sixty 51:31 seconds after sixty Seconds you have to 51:33 stop and this is part of the machinery 51:37 for making sure that we don't end up 51:39 with two primaries I'll talk about a bit 51:41 later okay so now we were primary now 51:46 the master tells the client who the 51:50 primary and the secondary czar and at 51:54 this point we're we're executing in 51:59 figure two in the paper the client now 52:02 knows who the primary secondaries are in 52:04 some order or another and the paper 52:05 explains a sort of clever way to manage 52:08 this in some order or another the client 52:10 sends a copy of the data it wants to be 52:13 appended to the primary in all the 52:15 secondaries and the primary in the 52:18 secondaries write that data to a 52:20 temporary location it's not appended to 52:22 the file yet after they've all said yes 52:24 we have the data the client sends a 52:29 message to the primary saying look you 52:31 know you and all the secondaries have 52:33 the data I'd like to append it for this 52:35 file 52:36 the primary maybe is receiving these 52:38 requests from lots of different clients 52:40 concurrently it picks some order execute 52:43 the client request one at a time and for 52:45 each client a pen request the primary 52:48 looks at the offset that's the end of 52:50 the file the current end of the current 52:53 chunk makes sure there's enough 52:54 remaining space in the chunk and then 52:56 tells then writes the clients record to 52:59 the end of the current chunk and tells 53:02 all the secondaries to also write the 53:04 clients data to the end to the same 53:08 offset the same offset in their chunks 53:12 all right so the primary picks an offset 53:20 all the replicas including the primary 53:26 are told to write 53:29 the new appended record at at offset the 53:36 secondary's they may do it they may not 53:38 do it I'm either run out of space maybe 53:41 they crashed maybe the network message 53:42 was lost from the primary so if a 53:45 secondary actually wrote the data to its 53:47 disk at that offset it will reply yes to 53:50 the primary if the primary collects a 53:52 yes answer from all of the secondaries 53:58 so if they all of all of them managed to 54:02 actually write and reply to the primary 54:03 saying yes I did it then the primary is 54:08 going to reply reply success to the 54:10 client if the primary doesn't get an 54:18 answer from one of the secondaries or 54:21 the secondary reply sorry something bad 54:23 happened I ran out of disk space my disk 54:25 I don't know what then the primary 54:28 replies no to the client and the paper 54:37 says oh if the client gets an error like 54:39 that back in the primary the client is 54:42 supposed to reissue the entire append 54:44 sequence starting again talking to the 54:46 master to find out the most grease the 54:48 chunk at the end of the file 54:50 I want to know the client supposed to 54:52 reissue the whole record append 54:54 operation ah you would think but they 55:01 don't so the question is jeez you know 55:05 the the primary tells all the replicas 55:08 to do the append yeah maybe some of them 55:09 do some of them don't 55:10 right if some of them don't then we 55:12 apply an error to the client so the 55:14 client thinks of the append in happen 55:16 but those other replicas where the pen 55:18 succeeded they did append so now we have 55:23 replicas donor the same data one of them 55:25 the one that returned in error didn't do 55:27 the append and the ones they returned 55:28 yes did do the append so that is just 55:31 the way GFS works 55:44 yeah so if a reader then reads this file 55:47 they depending on what replica they be 55:50 they may either see the appended record 55:53 or they may not if the record append 55:56 but if the record append succeeded if 55:59 the client got a success message back 56:00 then that means all of the replicas 56:03 appended that record at the same offset 56:05 if the client gets a no back then zero 56:10 or more of the replicas may have 56:14 appended the record of that all set and 56:15 the other ones not so the client got to 56:20 know then that means that some replicas 56:22 maybe some replicas have the record and 56:25 some don't so what you which were 56:27 roughly read from you know you may or 56:29 may not see the record yeah 56:39 oh that all the replicas are the same 56:45 all the secondaries are the same version 56:47 number so the version number only 56:49 changes when the master assigns a new 56:51 primary which would ordinarily happen 56:53 and probably only happen if the primary 56:55 failed so what we're talking about is is 56:58 replicas that have the fresh version 57:00 number all right and you can't tell from 57:02 looking at them that they're missing 57:03 that the replicas are different but 57:08 maybe they're different and the 57:09 justification for this is that yeah you 57:11 know maybe the replicas don't all have 57:13 that the appended record but that's the 57:16 case in which the primary answer no to 57:18 the clients and the client knows that 57:20 the write failed and the reasoning 57:22 behind this is that then the client 57:24 library will reissue the append so the 57:27 appended record will show up you know 57:29 eventually the a pendel succeed you 57:33 would think because the client I'll keep 57:36 reissuing it until succeeds and then 57:38 when it succeeds that means there's 57:39 gonna be some offset you know farther on 57:41 in the file where that record actually 57:43 occurs in all the replicas as well as 57:45 offsets preceding that word only occurs 57:48 in a few of the replicas yes 58:04 oh this is a great question 58:11 the exact path that the right data takes 58:15 might be quite important with respect to 58:17 the underlying network and the paper 58:19 somewhere says even though when the 58:22 paper first talks about it he claims 58:24 that the client sends the data to each 58:26 replica in fact later on it changes the 58:29 tune and says the client sends it to 58:31 only the closest of the replicas and 58:33 then the replicas then that replica 58:36 forwards the data to another replica 58:37 along I sort of chained until all the 58:39 replicas had the data and that path of 58:41 that chain is taken to sort of minimize 58:43 crossing bottleneck inter switch links 58:46 in a data center yes the version number 59:00 only gets incremented if the master 59:03 thinks there's no primary so it's a so 59:06 in the ordinary sequence there already 59:09 be a primary for that chunk the the 59:13 the the master sort of will remember oh 59:16 gosh there's already a primary and 59:18 secondary for that chunk and it'll just 59:19 it won't go through this master 59:20 selection it won't increment the version 59:22 number it'll just tell the client look 59:24 up here's the primary with with no 59:26 version number change 59:42 my understanding is that if this is this 59:47 I think you're asking a you're asking an 59:49 interesting question so in this scenario 59:51 in which the primaries isn't answered 59:52 failure to the client you might think 59:54 something must be wrong with something 59:56 and that it should be fixed before you 59:57 proceed in fact as far as I can tell the 59:59 paper there's no immediate anything the 60:03 client retries the append you know 60:08 because maybe the problem was a network 60:10 message got lost so there's nothing to 60:11 repair right you know now we're gonna 60:12 message got lost we should be 60:13 transmitted and this is sort of a 60:15 complicated way of retransmitting the 60:17 network message maybe that's the most 60:19 common kind of failure in that case just 60:21 we don't change anything it's still the 60:22 same primary same secondaries the client 60:26 we tries maybe this time it'll work 60:28 because the network doesn't 60:29 discard a message it's an interesting 60:31 question though that if what went wrong 60:32 here is that one of that there was a 60:35 serious error or Fault in one of the 60:37 secondaries what we would like is for 60:41 the master to reconfigure that set of 60:43 replicas to drop that secondary that's 60:46 not working and it would then because 60:49 it's choosing a new primary in executing 60:50 this code path the master would then 60:52 increment the version and then we have a 60:54 new primary and new working secondaries 60:56 with a new version and this not-so-great 61:00 secondary with an old version and a 61:02 stale copy of the data but because that 61:04 has an old version the master will never 61:07 never mistake it for being fresh but 61:09 there's no evidence in the paper that 61:10 that happens immediately as far as 61:12 what's said in the paper the client just 61:15 retries and hopes it works again later 61:17 eventually the master will if the 61:19 secondary is dead 61:21 eventually the master does ping all the 61:23 trunk servers will realize that and will 61:25 probably then change the set of 61:30 primaries and secondaries and increment 61:32 the version but only only later 61:40 the lease the leases that the answer to 61:45 the question what if the master thinks 61:49 the primary is dead because it can't 61:52 reach it right that's supposing we're in 61:53 a situation where at some point the 61:55 master said you're the primary and the 61:58 master was like painting them all the 61:59 service periodically to see if they're 62:01 alive because if they're dead and wants 62:02 to pick a new primary the master sends 62:05 some pings to you you're the primary and 62:07 you don't respond right so you would 62:09 think that at that point where gosh 62:11 you're not responding to my pings then 62:14 you might think the master at that point 62:16 would designate a new primary it turns 62:20 out that by itself is a mistake and the 62:23 reason for that the reason why it's a 62:26 mistake to do that simple did you know 62:30 use that simple design is that I may be 62:32 pinging you and the reason why I'm not 62:33 getting responses is because then 62:35 there's something wrong with a network 62:36 between me and you so there's a 62:38 possibility that you're alive you're the 62:39 primary you're alive I'm peeing you the 62:41 network is dropping that packets but you 62:42 can talk to other clients and you're 62:44 serving requests from other clients you 62:46 know and if I if I the master sort of 62:49 designated a new primary for that chunk 62:51 now we'd have two primaries processing 62:54 rights but two different copies of the 62:56 data and so now we have totally 62:58 diverging copies the data and that's 63:02 called that error having two primaries 63:07 or whatever processing requests without 63:10 knowing each other it's called squid 63:12 brain and I'm writing this on board 63:16 because it's an important idea and it'll 63:19 come up again and it's caused or it's 63:23 usually said to be caused by network 63:24 partition that is some network error in 63:33 which the master can't talk to the 63:34 primary but the primary can talk to 63:35 clients sort of partial network failure 63:38 and you know these are some of the these 63:41 are the hardest problems to deal with 63:44 and building these kind of storage 63:46 systems okay so that's the problem is we 63:49 want to rule out the possibility of 63:51 mistakingly designating too 63:54 I'm Aries for the same chunk the way the 63:56 master achieves that is that when it 63:58 designates a primary it says it gives a 64:00 primary Elyse which is basically the 64:03 right to be primary until a certain time 64:05 the master knows it remembers and knows 64:08 how long the least lasts and the primary 64:12 knows how long is least lasts if the 64:14 lease expires the primary knows that it 64:18 expires and will simply stop executing 64:20 client requests it'll ignore or reject 64:23 client requests after the lease expired 64:24 and therefore if the master can't talk 64:27 to the primary and the master would like 64:29 to designate a new primary the master 64:31 must wait for the lease to expire for 64:33 the previous primary so that means 64:35 master is going to sit on its hands for 64:37 one lease period 60 seconds after that 64:40 it's guaranteed the old primary will 64:41 stop operating its primary and now the 64:44 master can see if he doesn't need a new 64:46 primary without producing this terrible 64:50 split brain situation 65:02 oh so the question is why is designated 65:14 a new primary bad since the clients 65:15 always ask the master first and so the 65:18 master changes its mind then subsequent 65:20 clients will direct the clients to the 65:22 new primary well one reason is that the 65:26 clients cash for efficiency the clients 65:28 cash the identity of the primary for at 65:31 least for short periods of time even if 65:34 they didn't though the bad sequence is 65:37 that I'm the prime the master you ask me 65:40 who the primary is I send you a message 65:43 saying the primary is server one right 65:46 and that message is inflate in the 65:47 network and then I'm the master I you 65:50 know I think somebody's failed whatever 65:52 I think that primary is filled I 65:53 designated a new primary and I send the 65:55 primary message saying you're the 65:56 primary and I start answering other 65:57 clients who ask the primary is saying 66:00 that that over there is the primary 66:01 while the message to you is still in 66:03 flight you receive the message saying 66:04 the old primaries the primary you think 66:07 gosh I just got this from the master I'm 66:10 gonna go talk to that primary and 66:11 without some much more clever scheme 66:13 there's no way you could realize that 66:14 even though you just got this 66:16 information from the master it's already 66:19 out of date and if that primary serves 66:21 your modification requests now we have 66:24 to and and respond success to you right 66:27 then we have two conflicting replicas 66:35 yes 66:41 again you've a new file and no replicas 66:50 okay so if you have a new file no 66:53 replicas or even an existing file and no 66:55 replicas the you'll take the path I drew 66:58 on the blackboard the master will 67:00 receive a request from a client saying 67:02 oh I'd like to append to this file and 67:04 then well I guess the master will first 67:06 see there's no chunks associated with 67:08 that file and it will just make up a new 67:11 chunk identifier or perhaps by calling 67:13 the random number generator and then 67:15 it'll look in its chunk information 67:17 table and see gosh I don't have any 67:20 information about that chunk and it'll 67:22 make up a new record saying but it must 67:24 be special case code where it says well 67:26 I don't know any version number this 67:28 chunk doesn't exist I'm just gonna make 67:30 up a new version number one pick a 67:32 random primary and set of secondaries 67:35 and tell them look you are responsible 67:37 for this new empty chunk please get to 67:40 work the paper says three replicas per 67:47 chunk by default so typically a primary 67:50 and two backups 68:03 okay okay so the maybe the most 68:13 important thing here is just to repeat 68:16 the discussion we had a few minutes ago 68:21 the intentional construction of GFS we 68:32 had these record a pens is that if we 68:33 have three we have three replicas you 68:41 know maybe a client sends in and a 68:43 record a pen for record a and all three 68:46 replicas or the primary and both of the 68:49 secondaries successfully append the data 68:52 the chunks and maybe the first record in 68:54 the trunk might be a in that case and 68:55 they all agree because they all did it 68:57 supposing another client comes in says 69:00 look I want a pen record B but the 69:03 message is lost to one of the replicas 69:06 the network whatever supposably the 69:08 message by mistake but the other two 69:11 replicas get the message and one of 69:13 them's a primary and my other 69:14 secondaries they both depend of the file 69:16 so now what we have is two the replicas 69:19 that B and the other one doesn't have 69:21 anything and then may be a third client 69:26 wants to append C and maybe the remember 69:29 that this is the primary the primary 69:30 picks the offset since the primary just 69:32 gonna tell the secondaries look in a 69:35 right record C at this point in the 69:38 chunk they all right C here now the 69:43 client for be the rule for a client for 69:45 B that for the client that gets us error 69:47 back from its request is that it will 69:50 resend the request so now the client 69:53 that asked to append record B will ask 69:56 again to a pen record B and this time 69:57 maybe there's no network losses and all 70:00 three replicas as a panel record be 70:05 right and they're all lives there I'll 70:07 have the most fresh version number and 70:09 now if a client reads 70:13 what they see depends on the track which 70:17 replicas they look at it's gonna see in 70:20 total all three of the records but it'll 70:22 see in different orders depending on 70:25 which replica reads it'll mean I'll see 70:28 a B C and then a repeat of B so if it 70:31 reads this replica it'll see B and then 70:33 C if it reads this replica it'll see a 70:36 and then a blank space in the file 70:39 padding and then C and then B so if you 70:41 read here you see C then B if you read 70:44 here you see B and then C so different 70:47 readers will see different results and 70:49 maybe the worst situation is it some 70:52 client gets an error back from the 70:54 primary because one of the secondaries 70:58 failed to do the append and then the 71:00 client dies before we sending the 71:02 request so then you might get a 71:04 situation where you have record D 71:07 showing up in some of the replicas and 71:11 completely not showing up anywhere in 71:13 the other replicas so you know under 71:16 this scheme we have good properties for 71:19 for appends that the primary sent back a 71:23 successful answer for and sort of not so 71:26 great properties for appends where the 71:29 primary sent back of failure and the 71:32 records the replicas just absolutely be 71:35 different all different sets of replicas 71:37 yes 71:44 my reading in the paper is that the 71:46 client starts at the very beginning of 71:49 the process and asked the master again 71:51 what's the last chunk in this file you 71:54 know because it might be might have 71:55 changed if other people are pending in 71:56 the file yes 72:17 so I can't you know I can't read the 72:20 designers mind so the observation is the 72:22 system could have been designed to keep 72:24 the replicas in precise sync it's 72:27 absolutely true and you will do it in 72:30 labs 2 & 3 so you guys are going to 72:33 design a system that does replication 72:34 that actually keeps the replicas in sync 72:36 and you'll learn you know there's some 72:38 various techniques various things you 72:41 have to do in order to do that and one 72:43 of them is that there just has to be 72:46 this rule if you want the replicas to 72:47 stay in sync it has to be this rule that 72:50 you can't have these partial operations 72:53 that are applied to only some and not 72:54 others and that means that there has to 72:56 be some mechanism to like where the 72:58 system even if the client dies where the 73:00 system says we don't wait a minute there 73:01 was this operation I haven't finished it 73:04 yet so you build systems in which the 73:07 primary actually make sure the backups 73:11 get every message 73:29 if the first right abhi failed you think 73:34 the sea should go with the beers 73:37 well it doesn't you may think it should 73:40 but the way the system actually operates 73:42 is that the primary will add C to the 73:46 end of the chunk and the after V yeah I 73:57 mean one reason for this is that at the 73:59 time the right Percy comes in the 74:01 primary may not actually know what the 74:03 fate of B was because we met multiple 74:05 clients submitting a pen's concurrently 74:07 and you know for high performance you 74:10 want the primary to start the append for 74:14 B first and then as soon as I can got 74:17 the next stop set tell everybody did you 74:20 see so that all this stuff happens in 74:21 parallel you know by slowing it down you 74:25 could you know the primary could sort of 74:31 decide that B it totally failed and then 74:33 send another round of messages saying 74:35 please undo the right of B and there'll 74:39 be more complex and slower I'm you know 74:43 again the the justification for this is 74:45 that the design is pretty simple it you 74:48 know it reveals some odd things to 74:53 applications and the hope was that 74:58 applications could be relatively easily 74:59 written to tolerate records being in 75:01 different orders or who knows what or if 75:04 they couldn't that applications could 75:08 either make their own arrangements for 75:11 picking an order themselves and writing 75:13 you know sequence numbers in the files 75:14 or something or you could just have a if 75:17 application really was very sensitive to 75:20 order you could just not have concurrent 75:21 depends from different clients to the 75:24 same file right you could just you know 75:27 close files where order is very 75:29 important like say it's a movie file you 75:31 know you don't want to scramble 75:32 bytes in a movie file you just write the 75:35 Moot file you write the movie to the 75:37 file by one client in sequential order 75:40 and not with concurrent record depends 75:49 okay all right 75:56 the somebody asked basically what would 76:04 it take to turn this design into one 76:06 which actually provided strong 76:08 consistency consistency closer to our 76:11 sort of single server model where 76:13 there's no surprises I don't actually 76:18 know because you know that requires an 76:20 entire new complex design it's not clear 76:22 how to mutate GFS to be that design but 76:24 I can list for you lists for you some 76:26 things that you would want to think 76:27 about if you wanted to upgrade GFS to a 76:32 assistance did have strong consistency 76:34 one is that you probably need the 76:37 primary to detect duplicate requests so 76:40 that when this second becomes in the 76:43 primary is aware that oh actually you 76:44 know we already saw that request earlier 76:47 and did it or didn't do it and to try to 76:50 make sure that B doesn't show up twice 76:52 in the file so one is you're gonna need 76:54 duplicate detection another issues you 76:59 probably if a secondary is acting a 77:02 secondary you really need to design the 77:05 system so that if the primary tells a 77:06 secondary to do something 77:08 the secondary actually does it and 77:10 doesn't just return error right for a 77:12 strictly consistent system having the 77:15 secondaries be able to just sort of blow 77:16 off primary requests with really no 77:20 compensation is not okay so I think the 77:24 secondaries have to accept requests and 77:25 execute them or if a secondary has some 77:28 sort of permanent damage like it's disk 77:30 got unplugged by mistake this you need 77:32 to have a mechanism to like take the 77:34 secondary out of the system so the 77:36 primary can proceed with the remaining 77:39 secondaries but GFS kind of doesn't 77:41 either at least not right away 77:45 and so that also means that when the 77:49 primary asks secondary's to append 77:50 something the secondaries have to be 77:52 careful not to expose that data to 77:54 readers until the primary is sure that 77:57 all the secondaries really will be able 77:59 to execute the append so you might need 78:02 sort of multiple phases in the rights of 78:05 first phase in which the primary asks 78:06 the secondaries look you know I really 78:09 like you to do this operation can you do 78:11 it but don't don't actually do it yet 78:13 and if all the secondaries answer with a 78:15 promise to be able to do the operation 78:17 only then the primary says alright 78:20 everybody go ahead and do that operation 78:22 you promised and people you know that's 78:24 the way a lot of real world systems 78:27 strong consistent systems work and that 78:28 trick it's called two-phase commit 78:32 another issue is that if the primary 78:34 crashes there will have been some last 78:38 set of operations that the primary had 78:40 launched started to the secondaries but 78:44 the primary crashed before it was sure 78:46 whether those all the secondaries got 78:48 there copied the operation or not so if 78:51 the primary crashes you know a new 78:54 primary one of the secondaries is going 78:56 to take over as primary but at that 78:57 point the second the new primary and the 79:01 remaining secondaries may differ in the 79:03 last few operations because maybe some 79:05 of them didn't get the message before 79:07 the primary crashed and so the new 79:09 primer has to start by explicitly 79:11 resynchronizing with the secondaries to 79:15 make sure that the sort of the tail of 79:17 their operation histories are the same 79:21 finally to deal with this problem of oh 79:24 you know there may be times when the 79:25 secondaries differ or the client may 79:28 have a slightly stale indication from 79:31 the master of which secondary to talk to 79:33 the system either needs to send all 79:35 client reads through the primary because 79:38 only the primary is likely to know which 79:41 operations have really happened or we 79:43 need a least system for the secondaries 79:45 just like we have for the primary so 79:47 that it's well understood that when 79:50 secondary Canon can't legally respond 79:55 a client and so these are the things I'm 79:56 aware of that would have to be fixed in 79:58 this system tor added complexity and 80:00 chitchat to make it have strong 80:02 consistency and you're actually the way 80:05 I got that list was by thinking about 80:08 the labs you're gonna end up doing all 80:09 the things I just talked about as part 80:12 of labs two and three to build a 80:13 strictly consistent system okay so let 80:18 me spend one minute on there's actually 80:21 I have a link in the notes to a sort of 80:23 retrospective interview about how well 80:25 GFS played out over the first five or 80:28 ten years of his life at Google so the 80:32 high-level summary is that the most is 80:36 that was tremendously successful and 80:37 many many Google applications used it in 80:40 a number of Google infrastructure was 80:43 built as a late like big file for 80:45 example BigTable I mean was built as a 80:47 layer on top of GFS and MapReduce also 80:50 so widely used within Google may be the 80:54 most serious limitation is that there 80:57 was a single master and the master had 80:59 to have a table entry for every file in 81:01 every chunk and that men does the GFS 81:04 use grew and they're about more and more 81:06 files the master just ran out of memory 81:08 ran out of RAM to store the files and 81:11 you know you can put more RAM on but 81:13 there's limits to how much RAM a single 81:15 machine can have and so that was the 81:18 most of the most immediate problem 81:19 people ran into in addition the load on 81:24 a single master from thousands of 81:25 clients started to be too much in the 81:28 master kernel they see if you can only 81:29 process however many hundreds of 81:30 requests per second especially the right 81:33 things to disk and pretty soon there got 81:35 to be too many clients another problem 81:39 with a some applications found it hard 81:41 to deal with this kind of sort of odd 81:44 semantics and a final problem is that 81:47 the master that was not an automatic 81:49 story for master failover 81:52 in the original in the GFS paper as we 81:54 read it like required human intervention 81:56 to deal with a master that had sort of 81:59 permanently crashed and needs to be 82:00 replaced and that could take tens of 82:03 minutes or more I was just too long for 82:05 failure recovery for some applications 82:09 okay excellent I'll see you on Thursday 82:13 and we'll hear more about all these 82:15 themes over the semester