字幕記錄 00:00 all right let's get started 00:05 this is 682 for distributed systems so 00:11 I'd like to start with just a brief 00:13 explanation of what I think a 00:14 distributed system is you know the core 00:18 of it is a set of cooperating computers 00:21 that are communicating with each other 00:23 over networked to get some coherent task 00:26 done and so the kinds of examples that 00:29 we'll be focusing on in this class are 00:31 things like storage for big websites or 00:34 big data computations such as MapReduce 00:38 and also somewhat more exotic things 00:41 like peer-to-peer file sharing so 00:43 they're all just examples the kinds of 00:45 case studies we'll look at and the 00:48 reason why all this is important is that 00:49 a lot of critical infrastructure out 00:51 there is built out of distributed 00:53 systems infrastructure that requires 00:56 more than one computer to get its job 00:57 done or it's sort of inherently needs to 00:59 be spread out physically so the reasons 01:04 why people build this stuff so first of 01:06 all before I even talk about distributed 01:08 systems sort of remind you that you know 01:10 if you're designing a system redesigning 01:12 you need to solve some problem if you 01:14 can possibly solve it on a single 01:16 computer you know without building a 01:18 distributed system you should do it that 01:20 way and there's many many jobs you can 01:23 get done on a single computer and it's 01:25 always easier so distributed systems you 01:29 know you should try everything else 01:30 before you try building distributed 01:32 systems because they're not they're not 01:34 simpler so the reason why people are 01:36 driven to use lots of cooperating 01:39 computers are they need to get 01:41 high-performance and the way to think 01:43 about that is they want to get achieve 01:45 some sort of parallelism lots of CPUs 01:50 lots of memories lots of disk arms 01:52 moving in parallel another reason why 01:56 people build this stuff is to be able to 01:58 tolerate faults 02:05 have two computers do the exact same 02:07 thing if one of them fails you can cut 02:09 over to the other one another is that 02:12 some problems are just naturally spread 02:15 out in space like you know you want to 02:17 do interbank transfers of money or 02:19 something well you know bank a has this 02:21 computer in New York City and Bank B as 02:23 this computer in London you know you 02:26 just have to have some way for them to 02:27 talk to each other and cooperate in 02:29 order to carry that out so there's some 02:31 natural sort of physical reasons systems 02:36 that are inherently physically 02:37 distributed for the final reason that 02:40 people build this stuff is in order to 02:42 achieve some sort of security goal so 02:44 often by if there's some code you don't 02:46 trust or you know you need to interact 02:49 with somebody but you know they may not 02:50 be immediate malicious or maybe their 02:53 code has bugs in it so you don't want to 02:55 have to trust it you may want to split 02:57 up the computation so you know your 02:59 stuff runs over there and that computer 03:01 my stuff runs here on this computer and 03:02 they only talk to each other to some 03:04 sort of narrow narrowly defined network 03:06 protocol assuming we may be worried 03:10 about you know security and that's 03:13 achieved by splitting things up into 03:14 multiple computers so that they can be 03:16 isolated the most of this course is 03:21 going to be about performance and fault 03:23 tolerance although the other two often 03:26 work themselves in by way of the sort of 03:28 constraints on the case studies that 03:30 we're going to look at you know all 03:32 these distributed systems so these 03:34 problems are because they have many 03:36 parts and the parts execute concurrently 03:39 because there are multiple computers you 03:42 get all the problems that come up with 03:43 concurrent programming all the sort of 03:45 complex interactions and we're 03:46 timing-dependent stuff and that's part 03:49 of what makes distributed systems hard 03:51 another thing that makes distributed 03:54 systems hard is that because again you 03:56 have multiple pieces plus a network you 03:59 can have very unexpected failure 04:02 patterns that is if you have a single 04:04 computer it's usually the case either 04:06 computer works or maybe it crashes or 04:08 suffers a power failure or something but 04:11 it pretty much either works or doesn't 04:12 work distributed systems made up of lots 04:14 of computers you can have partial 04:15 failures that is some pieces stopped 04:18 working other people other pieces 04:20 continue working or maybe the computers 04:22 are working but some part of the network 04:24 is broken or unreliable so partial 04:28 failures is another reason why 04:30 distributed systems are hard and a final 04:50 reason why it's hard is that you know 04:51 them the original reason to build the 04:53 distributed system is often to get 04:54 higher performance to get you know a 04:57 thousand computers worth of performance 04:59 or a thousand disk arms were the 05:01 performance but it's actually very 05:03 tricky to obtain that thousand X speed 05:06 up with a thousand computers there's 05:09 often a lot of roadblocks thrown in your 05:12 way so the Elven takes a bit of careful 05:20 design to make the system actually give 05:22 you the performance you feel you deserve 05:24 so solving these problems of course 05:26 going to be all about you know 05:27 addressing these issues the reason to 05:31 take the course is because often the 05:33 problems and the solutions are quite 05:35 just technically interesting they're 05:38 hard problems for some of these problems 05:40 there's pretty good solutions known for 05:42 other problems they're not such great 05:44 solutions now distributed systems are 05:47 used by a lot of real-world systems out 05:50 there like big websites often involved 05:53 you know vast numbers computers that are 05:55 you know put together as distributed 05:57 systems when I first started teaching 06:00 this course it was distributed systems 06:03 were something of an academic curiosity 06:05 you know people thought oh you know at a 06:07 small scale they were used sometimes and 06:09 people felt that oh someday they'd be 06:11 might be important but now particularly 06:14 driven by the rise of giant websites 06:16 that have you know vast amounts of data 06:18 and entire warehouses full of computers 06:21 distributed systems in the last 06:23 twenty years have gotten to be very 06:25 seriously important part of computing 06:29 infrastructure this means that there's 06:32 been a lot of attention paid to them a 06:34 lot of problems have been solved but 06:36 there's still quite a few unsolved 06:37 problems so if you're a graduate student 06:39 or you're interested in research there's 06:42 a lot to you let a lot of problems yet 06:45 to be solved in distributed systems that 06:47 you could look into his research and 06:49 finally if you like building stuff this 06:51 is a good class because it has a lab 06:54 sequence in which you'll construct some 06:56 fairly realistic distributed systems 06:58 focused on performance and fault 07:00 tolerance 07:01 so you've got a lot of practice building 07:04 districts just building distributed 07:06 systems and making them work all right 07:09 let me talk about course structure a bit 07:12 before I get started on real technical 07:16 content you should be able to find the 07:19 course website using Google and on the 07:22 course website is the lab assignments 07:24 and course schedule and also link to a 07:28 Piazza page where you can post questions 07:31 get answers the course staff I'm Robert 07:35 Morris I'll be giving the lectures I 07:36 also have for TAS you guys want to stand 07:39 up and show your faces the TAS are 07:44 experts at in particular at doing the 07:47 solving the labs they'll be holding 07:49 office hours so if you have questions 07:51 about the labs you can come you should 07:53 go to office hours or you could post 07:55 questions to Piazza the course has a 07:59 couple of important components one is 08:04 this lectures there's a paper for almost 08:09 every lecture there's two exams 08:17 there's the labs programming labs and 08:22 there's an optional final project that 08:25 you can do instead of one of the labs 08:36 the lectures will be about two big ideas 08:38 in distributed systems they'll also be a 08:42 couple of lectures that are more about 08:43 sort of lab programming stuff a lot of 08:47 our lectures will be taken up by case 08:48 studies a lot of the way that I sort of 08:50 try to bring out the content of 08:53 distributed systems is by looking at 08:55 papers some academics some written by 08:58 people in industry describing real 09:01 solutions to real problems 09:05 these lectures actually be videotaped 09:07 and I'm hoping to post them online so 09:10 that you can so if you're not here or 09:12 you want to review the lectures you'll 09:15 be able to look at the videotape 09:16 lectures the papers again there's one to 09:20 read per week most of a research paper 09:22 some of them are classic papers like 09:24 today's paper which I hope some of you 09:26 have read on MapReduce it's an old paper 09:28 but it was the beginning of its spurred 09:31 an enormous amount of interesting work 09:33 both academic and in the real world so 09:35 some are classic and some are more 09:37 recent papers sort of talking about more 09:40 up-to-date research what people are 09:41 currently worried about and from the 09:44 papers we'll be hoping to tease out what 09:46 the basic problems are what ideas people 09:49 have had that might or might not be 09:50 useful in solving distributed system 09:52 problems we'll be looking at sometimes 09:54 in implementation details in some of 09:56 these papers because a lot of this has 09:58 to do with actual construction of of 10:01 software based systems and we're also 10:03 going to spend a certain time looking at 10:04 evaluations people evaluating how fault 10:07 tolerant their systems by measuring them 10:09 or people measuring how much performance 10:11 or whether they got performance 10:12 improvement at all so I'm hoping that 10:17 you'll read the papers before coming to 10:19 class the lectures are maybe not going 10:22 to make as much sense if you haven't 10:24 already read the lecture because there's 10:26 not enough time to both explaining all 10:28 the content of the paper and have a sort 10:30 of interesting reflection on what the 10:32 paper means online class so you really 10:35 got to read the papers before I come 10:37 into class and hopefully one of the 10:38 things you'll learn in this class is how 10:40 to read a paper rapidly in the fish 10:42 and skip over the parts that maybe 10:44 aren't that important and sort of focus 10:47 on teasing out the important ideas on 10:51 the website there's for every link to 10:53 buy the schedule there's a question that 10:56 you should submit an answer for for 10:59 every paper I think the answers are due 11:00 at midnight and we also ask that you 11:02 submit a question you have about the 11:04 paper through the website in order both 11:08 to give me something to think about as 11:09 I'm preparing the lecture and if I have 11:11 time I'll try to answer at least a few 11:13 of the questions by email and the 11:17 question and the answer for each paper 11:18 due midnight the night before there's 11:22 two exams there's a midterm exam in 11:24 class I think on the last class meeting 11:26 before spring break and there's a final 11:32 exam during final exam week at the end 11:36 of the semester the exams are going to 11:37 focus mostly on papers and the labs and 11:42 probably the best way to prepare for 11:44 them as well as attending lecture and 11:46 reading the papers a good way to prepare 11:49 for the exams is to look at all exams we 11:51 have links to 20 years of old exams and 11:55 solutions and so you look at those and 11:57 sort of get a feel for what kind of 11:58 questions that I like to ask and indeed 12:01 because we read many of the same papers 12:03 inevitably I ask questions each year 12:05 that can't help but resemble questions 12:08 asked in previous years the labs there's 12:15 for programming labs the first one of 12:17 them is due Friday next week lab one is 12:25 a simple MapReduce lab to implement your 12:31 own version of the paper they write 12:33 today in which I'll be discussing in a 12:35 few minutes 12:36 lab 2 involves using a technique called 12:40 raft in order to get fault taught in 12:43 order to sort of allow in theory allow 12:47 any system to be made fault tolerant by 12:49 replicating it and having this raft 12:51 technique manage the replication and 12:53 manage sort of automatic cut 12:55 or if there's a field if one of the 12:57 replicated servers fails so this is rav4 13:00 fault tolerance in lad 3 you'll use your 13:08 raft implementation in order to build a 13:11 fault tolerant key-value server it'll be 13:18 replicated and fault tolerant in a lab 4 13:22 you'll take your replicated key-value 13:25 server and clone it into a number of 13:28 independent groups and you'll split the 13:30 data in your key value storage system 13:33 across all of these individual 13:35 replicated groups to get parallel 13:36 speed-up by running multiple replicated 13:39 groups in parallel and you'll also be 13:42 responsible for moving the various 13:47 chunks of data between different servers 13:50 as they come and go without dropping any 13:52 balls so this is a what's often called a 13:54 sharded key value service sharding 14:03 refers to splitting up the data 14:04 partitioning the data among multiple 14:07 servers in order to get parallel speed 14:10 up if you want instead of doing lab 4 14:16 you can do a project of your own choice 14:19 and the idea here is if you have some 14:21 idea for a distributed system you know 14:23 in the style of some of the distributed 14:26 systems we talked about in the class if 14:27 you have your own idea that you want to 14:28 pursue and you like to build something 14:30 and measure whether it worked in order 14:32 to explore your idea you can do a 14:34 project and so for a project you'll pick 14:38 some teammates because we require that 14:40 projects are done in teams of two or 14:44 three people so like some teammates and 14:47 send your project idea to us and we'll 14:49 think about it and say yes or no and 14:50 maybe give you some advice and then if 14:53 you go ahead and do if we say yes and 14:55 you want to do a project you do that and 14:56 instead of lab 4 and it's due at the end 14:59 of the semester and you know you'll you 15:00 should do some design work and build a 15:05 real system and then in the last day of 15:06 class you'll demonstrate your system 15:08 as well as handing in a short sort of 15:11 written report to us about what you 15:12 built and I posted on the website some 15:17 some ideas which might or may not be 15:20 useful for you to sort of spur thoughts 15:22 about what projects you might build but 15:25 really the best projects are one where 15:27 sort of you have a good idea for the 15:30 project and the idea is if you want to 15:32 do a project you should choose an idea 15:34 that's sort of in the same vein as the 15:36 systems that were talked about in this 15:39 class 15:40 okay back to labs the lab Greed's they 15:44 we give you you hand in your lab code 15:46 and we run some tests against it and 15:47 you're great early based on how many 15:49 tests you pass we give you all the tests 15:51 that we use those no hidden tests so if 15:55 you implement the lab and it reliably 15:56 passes all the tests and chances are 15:58 good unless there's something funny 16:00 going on which there sometimes is 16:02 chances are good that if you your coop 16:04 passes all the tests when you run it or 16:06 pass all the tests when we run it and 16:07 you'll get a four score full score so 16:10 hopefully there'll be no mystery about 16:11 what score you're likely to get on the 16:13 labs let me warn you that debugging 16:18 these labs can be time-consuming because 16:21 they're distributed systems and a lot of 16:23 concurrency and communication sort of 16:26 strange difficult to debug errors can 16:30 crop up so you really ought to start the 16:34 labs early don't don't even have a lot 16:37 of trouble if you be elapsed to the last 16:39 moment you got to start early if your 16:41 problems please come to the TAS office 16:43 hours and please feel free to ask 16:45 questions about the labs on Piazza and 16:49 indeed I hope if you know the answer 16:51 that you'll answer people's questions on 16:52 Piazza as well all right any questions 16:56 about the mechanics of the course yes 17:10 so the question is what is how does how 17:13 do the different factor these things 17:15 factoring the grade I forget but it's 17:17 all on the it's on the website under 17:20 something I think though it's the labs 17:24 are the single most important component 17:29 okay alright so this is a course about 17:36 about infrastructure for applications 17:39 and so all through this course there's 17:41 going to be a sort of split in the way I 17:42 talk about things between applications 17:45 which are sort of other people the 17:47 customer somebody else writes but the 17:49 applications are going to use the 17:51 infrastructure that we're thinking about 17:53 in this course and so the kinds of 17:55 infrastructure that tend to come up a 17:58 lot our storage communication and 18:13 computation and we'll talk about systems 18:16 that provide all three of these kinds of 18:19 infrastructure the the storage it turns 18:23 out that storage is going to be the one 18:24 we focus most on because it's a very 18:27 well-defined and useful abstraction and 18:30 usually fairly straightforward 18:32 abstraction so people know a lot about 18:34 how to build how to use and build 18:36 storage systems and how to build sort of 18:40 replicated fault tolerant 18:41 high-performance distributed 18:43 implementations of storage we'll also 18:46 talk about some some of our computation 18:48 systems like MapReduce for today is a 18:50 computation system and we will talk 18:54 about communications some but mostly 18:57 from the point is a tool that we need to 18:58 use to build distributed systems like 19:00 computers have to talk to each other 19:01 over a network you know maybe you need 19:03 reliability or something and so we'll 19:06 talk a bit about what we're actually 19:08 mostly consumers of communication if you 19:11 want to learn about communication 19:12 systems as sort of how they work that's 19:17 more the topic of six eight to nine 19:20 so for storage and computation a lot of 19:24 our goal is to be able to discover 19:27 abstractions where use of simplifying 19:31 the interface to these two storage and 19:34 computation distributed storage and 19:36 computation infrastructure so that it's 19:38 easy to build applications on top of it 19:40 and what that really means is that we 19:43 need to we'd like to be able to build 19:45 abstraction that hide the distributed 19:47 nature of these of these systems so the 19:51 dream which is rarely fully achieved but 19:54 the dream would be to be able to build 19:56 an interface that looks to an 19:58 application is if it's a non distributed 20:00 storage system just like a file system 20:02 or something that everybody already 20:03 knows how to program and has a pretty 20:05 simple model semantics we'd love to be 20:08 able to build interfaces that look and 20:09 act just like non distributed storage 20:13 and computation systems but are actually 20:17 you know vast extremely high performance 20:20 fault tolerant distributed systems 20:22 underneath so we both have abstractions 20:30 and you know as you'll see as a course 20:33 goes on we sort of you know only part of 20:37 the way there it's rare that you find an 20:39 abstraction for a distributed version of 20:41 storage or computation that has simple 20:44 behavior but he's just like the non just 20:49 non distributed version of storage that 20:51 everybody understands but people getting 20:52 better at this and we're gonna try to 20:59 study the ways and what people have 21:01 learned about building such abstractions 21:03 ok so what kind of what kind of topics 21:08 show up is we're considering these 21:10 abstractions the first one this first 21:13 topic general topic that we'll see a lot 21:15 a lot of the systems we looked at have 21:18 to do with implementation so for example 21:24 the kind of tools that you see a lot for 21:27 for ways people learn how to build these 21:30 systems are things like remote procedure 21:31 call 21:32 whose goal is to mask the fact that 21:35 we're communicating over an unreliable 21:36 Network another kind of implementation 21:44 topic that we'll see a lot is threads 21:49 which are a programming technique that 21:51 allows us to harness what allows us to 21:55 harness multi-core computers but maybe 21:56 more important for this class threads 21:58 are a way of structuring concurrent 22:00 operations in a way that's hopefully 22:03 simplifies the programmer view of those 22:06 concurrent operations and because we're 22:09 gonna use threads a lot it turns out 22:10 we're going to need to also you know 22:12 just as from an implementation level 22:13 spend a certain amount of time thinking 22:15 about concurrency control things like 22:16 locks and the main place that these 22:25 implementation ideas will come up in the 22:26 class they'll be touched on in many of 22:28 the papers but you're gonna come face 22:30 the face of all this in a big way in the 22:31 labs you need to build distributed you 22:34 know do the programming for distributed 22:35 system and these are like a lot of sort 22:38 of important tools you know beyond just 22:41 sort of ordinary programming these are 22:43 some of the critical tools that you'll 22:45 need to use to build distributed systems 22:50 another big topic that comes up in all 22:54 the papers we're going to talk about is 22:55 performance usually the high-level goal 23:05 of building a distributed system is to 23:07 get what people call scalable speed-up 23:11 so we're looking for scalability and 23:17 what I mean by scalability or scalable 23:21 speed-up is that if I have some problem 23:23 that I'm solving with one computer and I 23:26 buy a second computer to help me execute 23:29 my problem if I can now solve the 23:31 problem in half the time or maybe solve 23:34 twice as many problem instances you know 23:37 per minute on two computers as I had on 23:39 one then that's an example of 23:42 scalability so 23:44 sort of two times you know computers or 23:47 resources gets me you know two times 23:54 performance or throughput and this is a 24:01 huge hammer if you can build a system 24:02 that actually has this behavior Namie 24:05 that if you increase the number of 24:07 computers you throw at the problem by 24:08 some factor you get that factor more 24:12 throughput more performance out of the 24:14 system that's a huge win because you can 24:17 buy computers with just money right 24:21 whereas if in order to get the 24:23 alternative to this is that in order to 24:27 get more performance you have to pay 24:28 programmers to restructure your software 24:31 to get better performance to make it 24:33 more efficient or to apply some sort of 24:35 specialized techniques better algorithms 24:37 or something if you have to pay 24:39 programmers to fix your code to be 24:42 faster that's an expensive way to go 24:45 we'd love to be able just oh by thousand 24:47 computers instead of ten computers and 24:49 get a hundred times more throughput 24:51 that's fantastic and so this sort of 24:53 scalability idea is a huge idea in the 24:56 backs of people's heads when they're 24:58 like building things like big websites 24:59 that run on are you know building full 25:01 of computers if the building full of 25:04 computers is there to get a sort of 25:06 corresponding amount of performance but 25:09 you have to be careful about the design 25:12 in order to actually get that 25:13 performance so often the way this looks 25:19 when we're looking at diagrams or I'm 25:21 writing diagrams in this course is that 25:23 I'm not supposing we're building a 25:25 website ordinarily you might have a 25:27 website that you know has a HTTP server 25:32 let's say it has some types of users 25:36 many web browsers and they talk to a web 25:42 server running Python or PHP or whatever 25:44 sort of web server and the web server 25:49 talks to some kind of database 25:54 you know when you have one or two users 25:56 you can just have one computer running 25:58 both and maybe a computer for the web 26:00 server and a computer from the database 26:02 but maybe all of a sudden you get really 26:03 proper popular and you'll be up and 26:05 you've you know 100 million people sign 26:08 up your service ID how do you how do you 26:13 fix your c-certainly can it support 26:15 millions of people on a single computer 26:17 except by extremely careful 26:20 labor-intensive optimization but you 26:24 don't have time for so typically the way 26:27 you're going to speed things up the 26:29 first thing you do is buy more web 26:31 servers and just split the user so that 26:33 you know how few users or some fraction 26:35 the user go to a web server 1 and the 26:37 other half you send them to a web server 26:39 2 and because maybe you're building I 26:45 don't know what reddit or something 26:47 where all the users need to see the same 26:49 data ultimately you have all the web 26:51 servers talk to the backend and you can 26:54 keep on adding web servers for a long 26:55 time here and so this is a way of 27:01 getting parallel speed up on the web 27:03 server code you know if you're running 27:04 PHP or Python maybe it's not too 27:05 efficient as long as each individual web 27:09 server doesn't put too much load on the 27:11 database you can add a lot of web 27:12 servers before you run into problems but 27:17 this kind of scalability is rarely 27:20 infinite unfortunately certainly not 27:23 without serious thought and so what 27:25 tends to happen with these systems is 27:26 that at some point after you have 10 or 27:29 20 or 100 web servers all talking to the 27:31 same database now all of a sudden the 27:33 database starts to be a bottleneck and 27:35 adding more web servers no longer helps 27:37 so it's rare that you get full scale 27:38 ability to sort of infinite numbers of 27:42 adding infinite numbers of computers 27:44 some point you run out of gas because 27:46 the place at which you are adding more 27:48 computers is no longer the bottleneck by 27:51 having lots and lots of web servers we 27:52 basically moved the bottleneck 27:54 I think it's limiting performance from 27:56 the web servers to the database and at 28:01 this point actually you almost certainly 28:03 have to do a bit of design work because 28:05 it's rare that you can 28:07 there's any straightforward way to take 28:09 a single database and sort of refactor 28:13 things with it or you can take data 28:17 sorta in a single database and refactor 28:19 it so it's split over multiple databases 28:23 but it's often a fair amount of work and 28:26 because it's awkward but people many 28:29 people actually need to do this we're 28:32 gonna see a lot of examples in this 28:33 course in which the distributed system 28:34 people are talking about is a storage 28:37 system because the authors were running 28:40 you know something like a big website 28:42 that ran out of gas on a single database 28:45 or storage servers anyway so the 28:49 scalability story is we love to build 28:51 systems that scale this way but you know 28:56 it's hard to make it or takes work off 28:59 and design work to push this idea 29:01 infinitely far ok so another big topic 29:11 that comes up a lot is fault tolerance 29:22 if you're building a system with a 29:24 single computer in it well a single 29:27 computer often can stay up for years 29:29 like I have servers in my office that 29:31 have been up for years without crashing 29:33 you know the computer is pretty reliable 29:35 the operating systems reliable 29:37 apparently the power in my building is 29:39 pretty reliable so it's not uncommon to 29:41 have single computers it's just a for 29:43 amazing amount of time however if you're 29:46 building systems out of thousands of 29:48 computers then even if each computer can 29:50 be expected to stay up for a year with a 29:53 thousand computers that means you're 29:55 going to have like about three computer 29:57 failures per day in your set of a 30:00 thousand computers so solving big 30:02 problems with big distributed systems 30:04 turns sort of very rare fault tolerance 30:07 very real failure very rare failure 30:10 problems into failure problems that 30:12 happen just all the time in a system 30:14 with a thousand computers there's almost 30:15 certainly always something broken it's 30:18 always some computer that's either 30:20 crashed or mysteriously you know running 30:23 incorrectly or slowly or doing the wrong 30:24 thing or maybe there's some piece of the 30:26 network with a thousand computers we got 30:28 a lot of network cables and a lot of 30:31 network switches and so you know there's 30:33 always some network cable that somebody 30:35 stepped on and is unreliability or 30:37 network cable that fell out or some 30:38 networks which whose fan is broken and 30:40 the switch overheated and failed there's 30:43 always some little problem somewhere in 30:44 your building sized distributed system 30:48 so big scale turns problems from very 30:52 rare events you really don't have to 30:54 worry about that much into just constant 30:56 problems that means the failure has to 30:59 be really or the response the masking of 31:02 failures the ability to proceed without 31:03 failures just has to be built into the 31:05 design because there's always failures 31:08 and you know it's part of building you 31:12 know convenient abstractions for 31:14 application programmers we really need 31:16 that but to be able to build 31:17 infrastructure that as much as possible 31:19 hides the failures from application 31:21 programmers or masks them or something 31:23 so that every application programmer 31:26 doesn't have to have a complete 31:28 complicated story for all the different 31:30 kinds of failures that can occur there's 31:35 a bunch of different notions that you 31:37 can have about what it means to be fault 31:41 tolerant about a little more but you 31:43 know exactly what we mean by that we'll 31:46 see a lot of a lot of different flavors 31:48 but among the more common ideas you see 31:50 one is availability so you know some 31:58 systems are designed so that under some 32:01 kind certain kinds of failures not all 32:03 failures but certain kinds of failures 32:05 the system will keep operating despite 32:09 the failure while providing you know 32:13 undamaged service the same kind of 32:16 service it would have provided even if 32:17 there had been no failure so some 32:19 systems are available in that sense that 32:21 up and up you know so if you build a 32:24 replicated service that maybe has two 32:25 copies you know one of the replicas 32:28 replica servers fail fails maybe the 32:31 other server can continue operating 32:34 they both fail of course you can't you 32:37 know you can't promise availability in 32:40 that case so available systems usually 32:42 say well under certain set of failures 32:44 we're going to continue providing 32:46 service we're going to be available more 32:48 failures than that occur it won't be 32:50 available anymore 32:52 another kind of fault tolerance you 32:55 might you might have or in addition to 32:57 availability or by itself as 32:59 recoverability and what this means is 33:06 that if something goes wrong maybe the 33:08 service will stop working that it is 33:10 it'll simply stop responding to requests 33:13 and it will wait for someone to come 33:15 along and repair or whatever went wrong 33:17 but after the repair occurs the system 33:19 will be able to continue as if nothing 33:21 bad had gone wrong right so this is sort 33:24 of a weaker requirement than 33:25 availability because here we're not 33:27 going to do anything while while the 33:29 failed come until the failed component 33:31 has been repaired but the fact that we 33:33 can get up get going again without you 33:37 know but without any loss of correctness 33:39 is still a significant requirement it 33:41 means you know recoverable systems 33:43 typically need to do things like save 33:45 their latest date on disk or something 33:48 where they can get it back 33:49 you know after the power comes back up 33:51 and even among available systems in 33:56 order for a system to be useful in real 33:57 life usually what the way available 34:01 systems are SPECT is that they're 34:04 available until some number of failures 34:07 have happened if too many failures have 34:09 happened an available system will stop 34:11 working or you know will stop responding 34:14 at all but when enough things have been 34:18 repaired it'll continue operating so a 34:21 good available system will sort of be 34:23 recoverable as well in a sensitive to 34:25 many failures occur 34:26 it'll stop answering but then will 34:28 continue correctly after that so this is 34:35 what we love - this is what we'd love to 34:38 obtain the biggest hammer what we'll see 34:43 a number of approaches to solving these 34:45 problems there's really sort of 34:47 things that are the most important tools 34:50 we have in this department one is 34:52 non-volatile storage so that you know 34:55 something crash power fails or whatever 34:58 there's a building wide power failure we 35:01 can use non-volatile store it's like 35:02 hard drives or flash or solid-state 35:05 drives or something to sort of store a 35:07 check point or a log of the state of a 35:12 system and then when the power comes 35:14 back up or somebody repairs our power 35:16 suppliers notice what we'll be able to 35:18 read our latest state off the hard drive 35:20 and continue from there so so one tool 35:24 is sort of non-volatile storage and the 35:29 management of non-volatile storage just 35:31 Ning comes up a lot because non-volatile 35:32 storage tends to be expensive to update 35:35 and so a huge amount of the sort of 35:37 nitty-gritty of building sort of 35:39 high-performance fault-tolerant systems 35:42 is in you know clever ways to avoid 35:45 having to write the non-volatile storage 35:47 too much in the old days and even today 35:49 you know what writing non-volatile 35:53 storage meant was moving a disk arm and 35:55 waiting for a disk platter to rotate 35:58 both of which are agonizingly slow on 36:00 the scale of you know three gigahertz 36:04 microprocessors good things like flash 36:06 life is quite a bit better but still 36:08 requires a lot of thought to get good 36:10 performance out of and the other big 36:12 tool we have for fault tolerance is 36:14 replication and the management of 36:20 replicated copies is sort of tricky you 36:22 know that sort of he problem lurking in 36:26 any replicated system where we have two 36:28 servers each with a supposedly identical 36:30 copy of the system state the key problem 36:34 that comes up is always that the two 36:36 replicas will accidentally drift out of 36:38 sync and will stop being replicas right 36:41 and this is just you know with the back 36:43 of the every design that we're gonna see 36:45 for using replication to get fault 36:47 tolerance and lab - a lot - you're all 36:51 about management management of 36:53 replicated copies for fault tolerance 36:57 as you'll see it's pretty complex a 37:03 final topic final cross-cutting topic is 37:10 consistency so it's an example of what I 37:17 mean by consistency supposing we're 37:19 building a distributed storage system 37:22 and it's a key/value service so it just 37:24 supports two operations maybe there's a 37:26 put operation and you give it a key and 37:29 a value and that the storage system sort 37:33 of stashes away the value under as the 37:36 value for this key maintains it's just a 37:38 big table of keys and values and then 37:40 there's a good operation you the client 37:43 sends it a key and the storage service 37:47 is supposed to you know respond with the 37:49 value of the value it has stored for 37:50 that key right and this is kind of good 37:52 when I can't think of anything else as 37:54 an example of a distributed system all 37:56 Oh without key value services and 38:00 they're very useful right they're just 38:01 sort of a kind of fundamental simple 38:05 version of a storage system so of course 38:09 if you're an application programmer it's 38:11 helpful if these two operations kind of 38:15 have meanings attached to them that you 38:16 can go look in the manual and the manual 38:18 says you know what it what it means what 38:21 you'll get back if you call get right 38:23 and sort of what it means for you to 38:25 call put all right so it's immediate so 38:28 some sort of spec for what they meant 38:29 otherwise like who knows how can you 38:31 possibly write an application without a 38:32 description of what putting get are 38:35 supposed to do and this is the topic of 38:38 consistency and the reason why it's 38:40 interesting in distributed systems is 38:42 that both for performance and for fault 38:46 tolerant reasons fault tolerance reason 38:48 we often have more than one copy of the 38:50 data floating around so you know in a 38:53 non distributed system where you just 38:55 have a single server with a single table 38:59 there's often although not always but 39:02 there's often like relatively no 39:04 ambiguity about what pudding get could 39:05 possibly mean right in 39:07 to ative Lee you know what put means is 39:08 update the table and what get means is 39:10 just get me the version that's stored in 39:12 the table which but in a distributed 39:17 system where there's more than one copy 39:18 in the data due to replication or 39:20 caching or who knows what there may be 39:23 lots of different versions of this key 39:30 value pair floating around like if one 39:32 of the replicas you know if supposing 39:34 some client issues a put and you know 39:36 there's two copies of the the server so 39:43 they both have a key value table right 39:48 and maybe key one has value twenty on 39:51 both of them and then some client issues 39:55 a put nice we have client over here and 39:58 it's gonna send a put it wants to update 40:00 the value of one to be twenty-one all 40:03 right maybe it's counting stuff in this 40:04 key value server so sends a put with key 40:09 one and value twenty one it sends it to 40:13 the first server and it's about to send 40:15 the same put you know wants to update 40:18 both copies right it keeps them in sync 40:20 it's about to send this put but just 40:22 before it sends to put to the second 40:23 server crashes I power failure or bug an 40:26 operating system or something so now the 40:28 state were left in sadly is that we sent 40:30 this put and so we've updated one of the 40:35 two replicas didn't have value twenty 40:37 one but the other ones still with twenty 40:38 now somebody comes along and reads with 40:40 a get and they might get they want to 40:42 read the value associated with key one 40:45 they might get twenty one or they might 40:46 get twenty depending on who they talk to 40:48 and even if the rule is you always talk 40:50 to the top server first if you're 40:52 building a fault-tolerant system the 40:53 actual rule has to be oh you talk to the 40:56 top server first unless it's failed in 40:58 which case you talk to the bottom server 41:00 so either way someday you risk exposing 41:03 this stale copy of the data to some 41:06 future again it could be that many gets 41:08 get the updated twenty one and then like 41:10 next week all of a sudden some get 41:12 yields you know a week old copy of the 41:14 data so that's not very consistent 41:19 right so in order but you know it's the 41:23 kind of thing that could happen right 41:25 we're not careful so you know we need to 41:29 have we need to actually write down what 41:32 the rules are going to be about puts and 41:33 gets given this danger of due to 41:36 replication and it turns out there's 41:39 many different definitions you can have 41:42 of consistency you know many of them are 41:47 relatively straightforward many of them 41:48 sound like well I get yields the you 41:52 know value put by the most recently 41:55 completed put all right so that's 42:00 usually called strong consistency it 42:02 turns out also it's very useful to build 42:05 systems that have much weaker 42:06 consistency there for example do not 42:08 guarantee anything like a get sees the 42:11 value written by the most recent put and 42:15 the reason so there's there strongly 42:18 consistent systems they usually have 42:23 some version that gets seen most recent 42:25 puts although you have to there's a lot 42:27 of details to work out there's also 42:28 weekly consistent many sort of flavors 42:32 of weekly consistent systems that do not 42:33 make any such guarantee that you know 42:36 may guarantee well you're you know if 42:38 someone does a put then you may not see 42:41 the put you may see old values that 42:43 weren't updated by the put for an 42:45 unbounded amount of time maybe and the 42:49 reason for people being very interested 42:51 in wheat consistency schemes is that 42:53 strong consistency that is having Rezac 42:57 Chua lessee be guaranteed to see the 43:00 most recent right that's a very 43:02 expensive spec to implement because what 43:07 it means is almost certainly that you 43:08 have to somebody has to do a lot of 43:10 communication in order to actually 43:12 implement some notion of strong 43:14 consistency if you have multiple copies 43:16 it means that either the writer or the 43:20 reader or maybe both has to consult 43:22 every copy like in this case where you 43:26 know maybe a client crash left one 43:28 updated but not the other if we wanted 43:30 to implement strong 43:31 Sisseton see in them maybe a simple way 43:33 in this system we'd have readers read 43:35 both of the copies or if there's more 43:37 than one copy all the copies and use the 43:39 most recently written value that they 43:41 find but that's expensive that's a lot 43:44 of chitchat to read one value so in 43:49 order to avoid communication as much as 43:51 possible particularly if replicas are 43:54 far away people build weak systems that 43:56 might actually allow the stale read of 43:59 an old value in this case although 44:02 there's often more semantics attached to 44:05 that to try to make these weak schemes 44:06 more useful and we're this communication 44:10 problem you know strong consistency 44:13 requiring expensive communication where 44:16 this really runs you into trouble is 44:19 that if we're using replication for 44:21 fault tolerance then we really want the 44:24 replicas to have independent failure 44:26 probability to have uncorrelated failure 44:29 so for example putting both of the 44:31 replicas of our data in the same iraq in 44:34 the same machine room it's probably a 44:37 really bad idea 44:38 because if someone trips over the power 44:39 cable to that rack both of our copies of 44:42 our data are going to die because 44:44 they're both attached to the same power 44:46 cable in the same rack so in the search 44:49 for making replicas as independent and 44:53 failure as possible in order to get 44:54 decent fault tolerance people would love 44:57 to put different replicas as far apart 45:00 as possible like in different cities or 45:02 maybe on opposite sides of the continent 45:05 so an earthquake that destroys one data 45:07 center will be extremely unlikely to 45:09 also destroy the other data center that 45:11 as the other copy you know so we'd love 45:15 to be able to do that if you do that 45:17 then the other copy is thousands of 45:20 miles away and the rate at which light 45:23 travels means that it may take on the 45:26 order of milliseconds or tens of 45:28 milliseconds to communicate to a data 45:31 center across the continent in order to 45:33 update the other copy of the data and so 45:36 that makes this the communication 45:38 required for strong consistency for good 45:40 consistency potentially extremely 45:42 expensive like every time you want to do 45:44 one of these put opera 45:45 or maybe again depending on how you 45:46 implement it you might have to sit there 45:49 waiting for like 10 or 20 or 30 45:50 milliseconds in order to talk to both 45:52 copies of the data to ensure that 45:54 they're both updated or or both checked 45:56 to find the latest copy and that 46:01 tremendous expense right this is 10 or 46:04 20 or 30 milliseconds on machines that 46:06 after all I'll execute like a billion 46:07 instructions per second so we're wasting 46:09 a lot of potential instructions while we 46:11 wait people often go much weaker systems 46:14 you're allowed to only update the 46:16 nearest copy you're only consulted 46:17 nearest copy I mean there's a huge sort 46:20 of amount of academic and real-world 46:23 research on how to structure weak 46:26 consistency guarantees so they're 46:28 actually useful to applications and how 46:30 to take advantage of them in order to 46:31 actually get high performance alright so 46:36 that's a lightning preview of the 46:40 technical ideas in the course any 46:43 questions about this before I start 46:46 talking about MapReduce all right I want 46:50 to switch to Map Reduce that's a sort of 46:54 detailed case study that's actually 46:55 going to illustrate most of the ideas 46:57 that we've been talking about here now 47:02 produces a system that was originally 47:07 designed and built and used by Google I 47:11 think the paper dates back to 2004 the 47:15 problem they were faced with was that 47:17 they were running huge computations on 47:20 terabytes and terabytes of data like 47:22 creating an index of all of the content 47:27 of the web or analyzing the link 47:29 structure of the entire web in order to 47:32 identify the most important pages or the 47:35 most authoritative pages as you know the 47:37 whole web is what's even in those days 47:39 tens of terabytes of data building index 47:45 of the web is basically equivalent to a 47:47 sort running sort of the entire data 47:49 sort you know ones like reasonably 47:52 expensive and to run a sort on the 47:55 entire content to the way I've been a 47:56 single computer 47:58 how long would have taken but you know 47:59 it's weeks or months or years or 48:01 something so Google the time was 48:04 desperate to be able to run giant 48:06 computations on giant data on thousands 48:08 of computers in order that the 48:10 computations could finish rapidly it's 48:12 worth it to them to buy lots of 48:14 computers so that their engineers 48:16 wouldn't have to spend a lot of time 48:17 reading the newspaper or something 48:19 waiting for their big compute jobs to 48:22 finish and so for a while they had their 48:27 clever engineer or sort of handwrite you 48:29 know if you needed to write a web 48:30 indexer or some sort of Lincoln outlay a 48:32 blink analysis tool you know Google 48:35 bought the computers and they say here 48:37 engineers you know do write but never 48:38 whatever software you like on these 48:39 computers and you know they would 48:41 laborious ly write the sort of one-off 48:44 manually bitten software to take 48:46 whatever problem they were working on 48:47 and so to somehow farm it out to a lot 48:49 of computers and organize that 48:51 computation and get the data back if you 48:56 only hire engineers who are skilled 48:58 distributed systems experts maybe that's 49:01 ok although even then it's probably very 49:04 wasteful of engineering effort but they 49:07 wanted to hire people who were skilled 49:09 at something else and not necessarily 49:15 engineers who wanted to spend all their 49:16 time writing distributed system software 49:18 so they really needed some kind of 49:20 framework that would make it easy to 49:22 just have their engineers write the kind 49:26 of guts of whatever analysis they wanted 49:28 to do like the sort algorithm or a web 49:30 index or link analyzer or whatever just 49:32 write the guts of that application and 49:34 not be able to run it on a thousands of 49:36 computers without worrying about the 49:39 details of how to spread the work over 49:41 the thousands of computers how to 49:43 organize whatever data movement was 49:45 required how to cope with the inevitable 49:48 failures so they were looking for a 49:50 framework that would make it easy for 49:52 non specialists to be able to write and 49:54 run giant distributed computations and 50:00 so that's what MapReduce is all about 50:03 and the idea is that the programmer just 50:06 write the application designer 50:09 consumer of this distributed computation 50:12 I'm just be able to write a simple map 50:14 function and a simple reduce function 50:16 that don't know anything about 50:18 distribution and the MapReduce framework 50:20 would take care of everything else so an 50:25 abstract view of how what MapReduce is 50:27 up to is it starts by assuming that 50:30 there's some input and the input is 50:33 split up into some a whole bunch of 50:35 different files or chunks in some way so 50:37 we're imagining that no yeah you know 50:43 input file one and put file two etc you 50:51 know these inputs are maybe you know web 50:54 pages crawled from the web or more 50:55 likely sort of big files that contain 50:58 many web each of which contains many web 51:00 files crawl from the web all right and 51:03 the way Map Reduce 51:04 starts is that you're to find a map 51:07 function and the MapReduce framework is 51:09 gonna run your map function on each of 51:15 the input files and of course you can 51:22 see here there's some obvious 51:23 parallelism available can run the maps 51:26 in parallel so the each of these map 51:28 functions only looks as this input and 51:30 produces output the output that a map 51:32 function is required to produce is a 51:33 list you know it takes a file as input 51:36 and the file is some fraction of the 51:39 input data and it produces a list of key 51:42 value pairs as output the map function 51:45 and so for example let's suppose we're 51:48 writing the simplest possible MapReduce 51:50 example a word count MapReduce job goal 51:56 is to count the number of occurrences of 51:58 each word so your map function might 52:00 emit key value pairs where the key is 52:02 the word and the value is just one so 52:06 for every word at C so then this map 52:08 function will split the input up into 52:10 words or everywhere ditzies 52:11 it emits that word as the key and 1 as 52:14 the value and then later on will count 52:16 up all those ones in order to get the 52:18 final output so you know maybe input 1 52:21 has the word 52:23 a in it and the word B in it and so the 52:26 output the map is going to produce is 52:28 key a value one key B value one maybe 52:32 the second not communication sees a file 52:35 that has a B in it and nothing else so 52:38 it's going to implement output b1 maybe 52:43 this third input has an A in it and a C 52:46 in it alright so we run all these maps 52:50 on all the input files and we get this 52:53 intermediate with the paper calls 52:55 intermediate output which is for every 52:57 map a set of key value pairs as output 53:00 then the second stage of the computation 53:03 is to run the reduces and the idea is 53:07 that the MapReduce framework collects 53:09 together all instances from all maps of 53:12 each key word so the MapReduce framework 53:15 is going to collect together all of the 53:16 A's you know from every map every key 53:20 value pair whose key was a it's gonna 53:22 take collect them all and hand them to 53:30 one call of the programmer to find 53:33 reduce function and then it's gonna take 53:35 all the B's and collect them together of 53:38 course you know requires a real 53:39 collection because they were different 53:42 instances of key B were produced by 53:44 different indications of map on 53:46 different computers so we're not talking 53:48 about data movement I'm so we're gonna 53:50 collect all the B keys and hand them to 53:53 a different call to reduce that has all 53:58 of the B keys as its arguments and same 54:01 as C so there's going to be the 54:07 MapReduce framework will arrange for one 54:09 call to reduce for every key that 54:11 occurred in any of the math output and 54:17 you know for our sort of silly word 54:19 count example all these reduces have to 54:23 do or any one of them has to do is just 54:25 count the number of items passed to it 54:28 doesn't even have to look at the items 54:29 because it knows that each of them is 54:31 the word is responsible for plus one is 54:34 the value you don't have to look at 54:35 those ones we've just count 54:36 so this reduce is going to produce a and 54:41 then the count of its inputs this reduce 54:44 it's going to produce the key associated 54:47 with it and then count of its values 54:50 which is also two so this is what a 54:57 typical MapReduce job looks like the 55:01 high level just for completeness the 55:07 well some a little bit of terminology 55:09 the whole computation is called the job 55:12 anyone invocation of MapReduce is called 55:16 a task so we have the entire job and 55:19 it's made up of a bunch of math tasks 55:21 and then a bunch of produced tasks so 55:27 it's an example for this word count you 55:29 know the what the map and reduce 55:31 functions would look like the map 55:40 function takes a key in the value as 55:45 arguments and now we're talking about 55:46 functions like written in an ordinary 55:48 programming language like C++ or Java or 55:51 who knows what so this is just code 55:54 people ordinary people can write what a 55:57 map function for word count would do is 55:58 split the the key is the file name which 56:02 typically is ignored we really care what 56:05 the file name was and the V is the 56:07 content of this maps input file so V is 56:12 you know just contains all this text 56:14 we're gonna split V into words and then 56:21 for each word 56:30 we're just gonna emit and emit takes two 56:34 arguments mitts you know calmly map can 56:36 make emit is provided by the MapReduce 56:38 framework we get to produce we hand emit 56:41 a key which is the word and a value 56:44 which is the string one so that's it for 56:49 the map function and a word count map 56:53 function and MapReduce literally it 56:54 could be this simple 56:56 so there's sort of promise to make the 57:00 and you know this map function doesn't 57:02 know anything about distribution or 57:04 multiple computers or the fact we need 57:06 we need to move data across the network 57:07 or who knows what 57:09 this is extremely straightforward and 57:13 the reduce function for a word count the 57:19 reduce is called with you know remember 57:21 each reduce is called with sort of all 57:23 the instances of a given key on the 57:25 MapReduce framework calls reduce with 57:27 the key that it's responsible for and a 57:30 vector of all the values that the maps 57:33 produced associated with that key the 57:38 key is the word the values are all ones 57:40 we don't like here about them we only 57:41 care about how many they were and so 57:44 reduce has its own omit function that 57:47 just takes a value to be emitted as the 57:51 final output as the value for the this 57:53 key so we're gonna admit a length of 57:57 this array so this is also about as 58:01 simplest reduce functions have are and 58:04 in Map Reduce namely extremely simple 58:08 and requiring no knowledge about fault 58:11 tolerance or anything else alright any 58:15 questions about the basic framework yes 58:27 [Music] 58:36 you mean can you feed the output of the 58:39 reducers sort of oh yes oh yes in in in 58:48 in real life all right 58:50 in real life it is routine among 58:53 MapReduce users to you know define a 58:55 MapReduce job that took some inputs and 58:58 produce some outputs and then have a 59:00 second MapReduce job you know you're 59:01 doing some very complicated multistage 59:03 analysis or iterative algorithm like 59:08 PageRank for example which is the 59:10 algorithm Google uses to sort of 59:13 estimate how important or influential 59:16 different webpages are that's an 59:18 iterative algorithm is sort of gradually 59:21 converges on an answer and if you 59:22 implement in MapReduce which I think 59:24 they originally did you have to run the 59:26 MapReduce job multiple times and the 59:28 output of each one is sort of you know 59:30 list of webpages with an updated sort of 59:34 value or weight or importance for each 59:36 webpage so it was routine to take this 59:38 output and then use it as the input to 59:40 another MapReduce job oh yeah well yeah 59:53 you need to sort of set things up the 59:56 output you need to rate the reduced 59:58 function sort of in the knowledge that 59:59 oh I need to produce data that's in the 60:02 format or as the information required 60:05 for the next MapReduce job I mean this 60:07 actually brings up a little bit of a 60:09 shortcoming in the MapReduce framework 60:11 which is it's great if you are if the 60:16 algorithm you need to run is easily 60:18 expressible as a math followed by this 60:20 sort of shuffling of the data by key 60:23 followed by a reduce and that's it 60:26 my MapReduce is fantastic for algorithms 60:28 that can be cast in that form and we're 60:30 furthermore each of the maps has to be 60:32 completely independent and 60:33 are required to be functional pure 60:39 functional functions that just look at 60:42 their arguments and nothing else 60:44 you know that's like it's a restriction 60:46 and it turns out that many people want 60:48 to run much longer pipelines that 60:49 involve lots and lots of different kinds 60:51 of processing and with MapReduce you 60:53 have to sort of cobble that together 60:54 from multiple MapReduce distinct 60:58 MapReduce jobs and more advanced systems 61:00 which we will talk about later in the 61:02 course are much better at allowing you 61:04 to specify the complete pipeline of 61:06 computations and they'll do optimization 61:08 you know the framework realizes all the 61:10 stuff you have to do and organize much 61:13 more complicated efficiently optimize 61:15 much more complicated computations 61:39 from the programmers point of view it's 61:41 just about map and reduce from our point 61:44 of view it's going to be about the 61:45 worker processes and the worker servers 61:49 that that are they're part of MapReduce 61:53 framework that among many other things 61:55 call the map and reduce functions so 62:00 yeah from our point of view we care a 62:01 lot about how this is organized by the 62:04 surrounding framework this is sort of 62:06 the programmers view with all the 62:08 distributive stuff stripped out yes 62:15 sorry I gotta say it again oh you mean 62:25 where does the immediate data go okay so 62:32 there's two questions one is when you 62:35 call a MIT what happens to the data and 62:38 the other is where the functions run so 62:46 the actual answer is that first where 62:50 the stuff rotten there's a number of say 62:53 a thousand servers um actually the right 62:56 thing to look at here is figure one in 62:58 the paper sitting underneath this in the 63:02 real world there's some big collection 63:04 of servers and we'll call them maybe 63:09 worker servers or workers and there's 63:12 also a single master server that's 63:14 organizing the whole computation and 63:16 what's going on here is the master 63:18 server for know knows that there's some 63:22 number of input files you know five 63:24 thousand input files and it farms out in 63:27 vacations of map to the different 63:29 workers so it'll send a message to 63:30 worker seven saying please run you know 63:34 this map function on such-and-such an 63:37 input file and then the worker function 63:41 which is you know part of MapReduce and 63:43 knows all about Map Reduce well then 63:47 read the file read the input whatever 63:50 whichever input file and call this map 63:54 function with the file name value as its 63:56 arguments then that worker process will 64:00 employees what implements in it and 64:02 every time the map calls emit the worker 64:05 process will write this data to files on 64:10 the local disk so what happens to map 64:12 emits and is they produce files on the 64:17 map workers local discs that are 64:19 accumulating all the keys and values 64:21 produced by the maps run on that worker 64:26 so at the end of the math phase what 64:30 we're left with is all those worker 64:32 machines each of which has the output of 64:35 some of whatever maps were run on that 64:37 worker machine then the MapReduce 64:42 workers arrange to move the data to 64:45 where it's going to be needed for the 64:46 reduces so and since and a you know in a 64:50 typical big computation you know this 64:53 this reduce indication is going to need 64:55 all map output that 64:59 mentioned the key a but it's gonna turn 65:01 out you know this is a simple example 65:04 but probably in general every single map 65:08 indication will have produce lots of 65:10 keys including some instances of key a 65:12 so typically in order before we can even 65:15 run this reduce function the MapReduce 65:17 framework that is the MapReduce worker 65:20 running on one of our thousand servers 65:22 is going to have to go talk to every 65:24 single other of the thousand servers and 65:26 say look you know I'm gonna run the 65:28 reduce for key a please look at the 65:31 intermediate map output stored in your 65:33 disk and fish out all of the instances 65:35 of key a and send them over the network 65:38 to me so the reduce worker is going to 65:41 do that it's going to fetch from every 65:43 worker all of the instances of the key 65:45 that it's responsible for that the 65:47 master has told it to be responsible for 65:50 and once it's collected all of that data 65:51 then it can call reduce and the reduce 65:55 function itself calls reduce omit which 65:58 is different from the map in it and what 66:01 reduces emit does is writes the output 66:04 to a file in a cluster file service that 66:12 Google uses so here's something I 66:14 haven't mentioned I haven't mentioned 66:17 where the input lives and where the 66:21 output lives they're both files because 66:25 any piece of input we want the 66:28 flexibility to be able to read any piece 66:31 of input on any worker server that means 66:34 we need some kind of network file system 66:36 to store the input data and so indeed 66:42 the paper talks about this thing called 66:44 GFS or Google file system and GFS is a 66:50 cluster file system and BFS actually 66:51 runs on exactly the same set of workers 66:54 that work our servers that run MapReduce 66:56 and the input GFS just automatically 67:00 when you you know it's a file system you 67:02 can read in my files it just 67:03 automatically splits up any big file you 67:06 store on it across lots of servers and 67:08 64 megabyte chunks so if you write 67:12 if you view of ten terabytes of crawled 67:14 web page contents and you just write 67:17 them to GFS even as a single big file 67:20 GFS will automatically split that vast 67:23 amount of data up into 64 kilobyte 67:25 chunks distributed evenly over all of 67:28 the GFS servers which is to say all the 67:30 servers that Google has available and 67:32 that's fantastic that's just what we 67:34 need if we then want to run a MapReduce 67:36 job that takes the entire crawled web as 67:39 input the data is already stored in a 67:42 way that split up evenly across all the 67:44 servers and so that means that the map 67:47 workers you know we're gonna launch you 67:49 know if we have a thousand servers we're 67:51 gonna launch a thousand map workers each 67:53 reading one 1000s at the input data and 67:55 they're going to be able to read the 67:57 data in parallel from a thousand GFS 68:01 file servers thus getting now tremendous 68:04 total read throughput you know the read 68:07 through put up a thousand servers 68:20 so so are you thinking maybe that Google 68:23 has one set of physical machines among 68:25 GFS and a separate set of physical 68:27 machines that run MapReduce jobs okay 68:40 right so the question is what does this 68:44 arrow here actually involve and the 68:48 answer that actually it sort of changed 68:50 over the years as Google's 68:51 involve this system but you know what 68:55 this in those general case if we have 68:58 big files stored in some big Network 69:01 file system like you know it's like GFS 69:02 is a bit like AFS you might have used on 69:05 Athena where you go talk to some 69:07 collection and your data split over big 69:09 collection o servers you have to go talk 69:11 to those servers over the network to 69:12 retrieve your data in that case what 69:14 this arrow might represent is the meta 69:17 MapReduce worker process has to go off 69:20 and talk across the network to the 69:22 correct GFS server or maybe servers that 69:25 store it's part of the input and fetch 69:28 it over the network to the MapReduce 69:30 worker machine in order to pass the map 69:33 and that's certainly the most general 69:35 case and that was eventually how 69:37 MapReduce actually worked in the world 69:40 of this paper though and and if you did 69:44 that that's a lot of network 69:45 communication are you talking about ten 69:47 terabytes of data and we have moved 10 69:49 terabytes across their data center 69:51 network which you know data center 69:54 networks wanting gigabits per second but 69:55 it's still a lot of time to move tens of 69:57 terabytes of data in order to try to and 70:02 indeed in the world of this paper in 70:04 2004 the most constraining bottleneck in 70:07 their MapReduce system was Network 70:08 throughput because they were running on 70:11 a network if you sort of read as far as 70:13 the evaluation section their network 70:18 their network as was they had thousands 70:24 of machines 70:27 whatever and they would collect machines 70:30 they would plug machines and you know 70:32 each rack of machines and you know an 70:35 Ethernet switch for that rack or 70:36 something but then you know they all 70:38 need to talk to each other but there was 70:40 a route Ethernet switch that all of the 70:43 Rockies are net switches talked to and 70:45 this one and you know so if you just 70:47 pick some Map Reduce worker and some GFS 70:51 server you know chances are at least 70:52 half the time the communication between 70:54 them has to pass through this one 70:56 wouldn't switch their routes which had 70:58 only some amount of total throughput 71:01 which I forget you know some number of 71:05 gigabits per second and I forget the 71:09 number well but when I did the division 71:13 that is divided up to the total 71:17 throughput available in the routes which 71:19 by the roughly 2000 servers that they 71:21 used in the papers experiments what I 71:23 got was that each machine share of the 71:26 route switch or of the total network 71:27 capacity was only 50 megabits per second 71:30 per second in their setup 50 megabits 71:36 per second per machine and then might 71:41 seem like a lot 50 megabits gosh 71:43 millions and millions but it's actually 71:45 quite small compared to how fast a disks 71:47 Ron or CPUs run and so this with their 71:51 network this 50 megabits per second was 71:53 like a tremendous limit and so they 71:56 really stood on their heads in the 71:57 design described in the paper to avoid 72:00 using the network and they played a 72:02 bunch of tricks to avoid sending stuff 72:05 over the network when they possibly 72:07 could avoid it one of them was they 72:10 would they ran the gfs servers and the 72:14 MapReduce workers on the same set of 72:16 machines so they have a thousand 72:19 machines they'd run GFS they implement 72:23 their GFS service on that thousand 72:25 machines and run MapReduce on the same 72:27 thousand machines and then when the 72:29 master was splitting up the map work and 72:33 sort of farming it out to different 72:34 workers it would cleverly when it was 72:39 about to run the map that was going to 72:41 read from input file one it would figure 72:44 out from GFS which server actually holds 72:47 input file one on its local disk and it 72:50 would send the map for that input file 72:53 to the MapReduce software on the same 72:55 machine so that by default this arrow 72:59 was actually local local read from the 73:01 local disk and did not involve the 73:03 network and you know depending on 73:05 failures or load or whatever that 73:07 couldn't always do that but almost all 73:10 the maps would be run on the very same 73:11 machine and stored the data thus saving 73:13 them vast amount of time that they would 73:17 otherwise had to wait to move the input 73:19 data across the network the next trick 73:22 they played is that map as I mentioned 73:26 before stores this output on the local 73:28 disk of the machine that you run the map 73:29 on so again storing the output of the 73:31 map does not require network 73:33 communication he's not immediately 73:35 because the output stored in the disk 73:38 however we know for sure that one way or 73:42 another in order to group together all 73:45 of you know by the way the MapReduce is 73:46 defined in order to group together all 73:49 of the values associated with the given 73:51 key and pass them to a single invocation 73:55 to produce on some machine this is going 73:57 to require network communication we're 73:59 gonna you know we want to need to fetch 74:02 all bays and give them a single 74:03 machine that have to be moved across the 74:05 network and so this shuffle this 74:08 movement of the keys from is kind of 74:11 originally stored by row and on the same 74:14 machine that ran the map we need them 74:16 essentially to be stored on by column on 74:18 the machine that's going to be 74:19 responsible for reduce this 74:22 transformation of row storage 74:23 essentially column storage is called the 74:25 paper calls a shuffle and it really that 74:28 required moving every piece of data 74:30 across the network from the map that 74:33 produced it to the reduce that would 74:34 need it and now it's like the expensive 74:36 part of the MapReduce yeah 74:51 you're right you can imagine a different 74:53 definition in which you have a more kind 74:55 of streaming reduce I don't know I 74:57 haven't thought this through I don't 75:00 know why whether that would be feasible 75:02 or not certainly as far as programmer 75:04 interface like if the goal their 75:06 number-one goal really was to be able to 75:09 make it easy to program by people who 75:11 just had no idea of what was going on in 75:13 the system so it may be that you know 75:16 this speck this is really the way reduce 75:18 functions look and you know in C++ or 75:22 something like a streaming version of 75:24 this is now starting to look I don't 75:28 know how it look probably not this 75:30 symbol but you know maybe it could be 75:33 done that way and indeed many modern 75:35 systems people got a lot more 75:37 sophisticated with modern things that 75:41 are the successors the MapReduce and 75:43 they do indeed involve processing 75:45 streams of data often rather than this 75:48 very batch approach there is a batch 75:50 approach in the sense that we wait until 75:52 we get all the data and then we process 75:54 it so first of all that you then have to 75:57 have a notion of finite inputs right 75:59 modern systems often do indeed you 76:02 streams and and are able to take 76:05 advantage of some efficiencies do that 76:08 MapReduce okay so this is the point at 76:15 which this shuffle is where all the 76:17 network traffic happens this can 76:19 actually be a vast amount of data so if 76:21 you think about sort if you're sorting 76:23 the the output of the sort has the same 76:26 size as the input to the sort so that 76:29 means that if you're you know if your 76:30 input is 10 terabytes of data and you're 76:32 running a sort you're moving 10 76:34 terabytes of data across a network at 76:36 this point and your output will also be 76:38 10 terabytes and so this is quite a lot 76:40 of data and then indeed it is from any 76:42 MapReduce jobs although not all there's 76:44 some that significantly reduce the 76:46 amount of data at these stages somebody 76:49 mentioned Oh what if you want to feed 76:51 the output of reduce into another 76:52 MapReduce job and indeed that was often 76:55 what people wanted to do and 76:56 in case the output of the reduce might 76:58 be enormous like four sort or web and 77:00 mixing the output of the produces on ten 77:03 terabytes of input the output of the 77:05 reduces again gonna be ten terabytes so 77:07 the output of the reduce is also stored 77:09 on GFS and the system would you know 77:12 reduce would just produce these key 77:13 value pairs but the MapReduce framework 77:18 would gather them up and write them into 77:20 giant files on GFS and so there was 77:23 another round of network communication 77:27 required to get the output of each 77:30 reduce to the GFS server that needed to 77:33 store that reduce and because you might 77:35 think that they could have played the 77:37 same trick with the output of storing 77:39 the output on the GFS server that 77:42 happened to run the MapReduce worker 77:46 that ran the reduce and maybe they did 77:48 do that but because GFS as well as 77:51 splitting data for performance also 77:53 keeps two or three copies for fault 77:55 tolerance that means no matter what you 77:58 need to write one copy of the data 77:59 across a network to a different server 78:01 so there's a lot of network 78:03 communication here and a bunch here also 78:05 and I was this network communication 78:08 that really limited the throughput in 78:09 MapReduce 78:10 in 2004 in 2020 because this network 78:17 arrangement was such a limiting factor 78:19 for so many things people wanted to do 78:21 in datacenters modern data center 78:23 networks are a lot faster at the root 78:26 than this was and so you know one 78:28 typical data center network you might 78:30 see today actually has many root instead 78:32 of a single root switch that everything 78:34 has to go through you might have you 78:37 know many root switches and each rack 78:40 switch has a connection to each of these 78:42 sort of replicated root switches and the 78:44 traffic is split up among the root 78:46 switches so modern data center networks 78:48 have far more network throughput and 78:52 because of that actually modern I think 78:54 Google sort of stopped using MapReduce a 78:57 few years ago but before they stopped 79:00 using it the modern MapReduce actually 79:02 no longer tried to run the maps on the 79:04 same machine as the data stored on they 79:06 were happy to vote the data from 79:08 anywhere because they just assumed that 79:11 was extremely fast okay we're out of 79:16 time for MapReduce 79:18 we have a lab due at the end of next 79:21 week 79:22 in which you'll write your own somewhat 79:24 simplified MapReduce so have fun with 79:27 that 79:28 and see you on Thursday