字幕記錄 00:06 all right today today we're going to 00:13 talk about spark spark say essentially a 00:17 successor to MapReduce you can think of 00:21 it as a kind of evolutionary step in 00:24 MapReduce and one reason we're looking 00:28 at it is that it's widely used today for 00:31 data center computations that's turned 00:34 out to be very popular and very useful 00:37 one interesting thing it does which will 00:40 pay attention to is that it it 00:41 generalizes the kind of two stages of 00:43 MapReduce the map introduced into a 00:47 complete notion of multi-step data flow 00:51 graphs that and this is both helpful for 00:57 flexibility for the programmer it's more 00:59 expressive and it also gives the system 01:02 the SPARC system a lot more to chew on 01:04 when it comes to optimization and 01:07 dealing with faults dealing with 01:09 failures and also for the from the 01:12 programmers point of view it supports 01:14 iterative applications application said 01:16 you know loop over the data effectively 01:19 much better than that produced us you 01:21 can cobble together a lot of stuff with 01:24 multiple MapReduce applications running 01:27 one after another but it's all a lot 01:30 more convenient in and SPARC okay so I 01:36 think I'm just gonna start right off 01:38 with an example application this is the 01:41 code for PageRank and I'll just copy 01:47 this code with a few a few changes from 01:52 some sample source code in the 01:57 in the spark source I guess it's 02:01 actually a little bit hard to read let 02:02 me just give me a second law try to make 02:04 it bigger 02:14 all right okay so if this is if this is 02:18 too hard to read is there's a copy of it 02:20 in the notes and it's an expansion of 02:22 the code and section 3 to 2 in the paper 02:26 a page rank which is a algorithm that 02:31 Google uses pretty famous algorithm for 02:33 calculating how important different web 02:38 search results are what PageRank is 02:42 trying to do 02:43 well actually PageRank is sort of widely 02:46 used as an example of something that 02:49 doesn't actually work that well and 02:51 MapReduce and the reason is that 02:53 PageRank involves a bunch of sort of 02:56 distinct steps and worse PageRank 02:58 involves iteration there's a loop in it 03:01 that's got to be run many times and 03:03 MapReduce just has nothing to say about 03:06 about iteration the input the PageRank 03:12 for this version of PageRank is just a 03:15 giant collection of lines one per link 03:20 in the web and each line then has two 03:23 URLs the URL of the page containing a 03:26 link and the URL of the link that that 03:28 page points to and you know if the 03:31 intent is that you get this file from by 03:33 crawling the web and looking at all the 03:36 all collecting together all the links in 03:38 the web's the input is absolutely 03:40 enormous and as just a sort of silly 03:46 little example for us from when I 03:49 actually run this code I've given some 03:53 example input here and this is the way 03:55 the impro would really look it's just 03:56 lines each line with two URLs and I'm 03:59 using u1 that's the URL of a page and u3 04:03 for example as the URL of a link that 04:07 that page points to just for convenience 04:09 and so the web graph that this input 04:12 file represents there's only three pages 04:15 in it one two three I could just 04:21 interpret the links there's a link from 04:22 one two three 04:24 there's a link from one back to itself 04:27 there's a web link from two to three 04:30 there's a web link from two back to 04:32 itself and there's a web link from three 04:35 to one just like a very simple graph 04:39 structure what PageRank is trying to do 04:42 it's you know estimating the importance 04:45 of each page what that really means is 04:47 that it's estimating the importance 04:50 based on whether other important pages 04:53 have links to a given page and what's 04:56 really going on here is this kind of 04:58 modeling the estimated probability that 05:01 a user who clicks on links will end on 05:05 each given page so it has this user 05:08 model in which the user has a 85 percent 05:11 chance of following a link from the 05:14 users current page following a randomly 05:17 selected link from the users current 05:19 page to wherever that link leads and a 05:21 15% chance of simply switching to some 05:25 other page even though there's not a 05:27 link to it as you would if you you know 05:29 entered a URL directly into the browser 05:33 and the idea is that the he drank 05:38 algorithm kind of runs this repeatedly 05:43 it sort of simulates the user looking at 05:45 a page and then following a link and 05:48 kind of adds the from pages importance 05:51 to the target pages importance and then 05:53 sort of runs this again and it's going 05:55 to end up in the system like page rank 06:00 on SPARC it's going to kind of run this 06:02 simulation for all pages in parallel it 06:06 or literately 06:09 the and the idea is that it's going to 06:13 keep track the algorithms gonna keep 06:14 track of the page rank of every single 06:16 page or every single URL and update it 06:19 as it sort of simulates random user 06:22 clicks I mean that eventually that those 06:24 ranks will converge on kind of the true 06:27 final values now 06:31 because it's iterative although you can 06:34 code this up in rapid MapReduce it's a 06:37 pain it can't be just a single MapReduce 06:39 program it has to be multiple you know 06:45 multiple calls to a MapReduce 06:48 application where each call sort of 06:51 simulates one step in the iteration so 06:53 you can do in a MapReduce but it's a 06:55 pain and it's also kind of slope because 06:58 MapReduce it's only thinking about one 07:00 map and one reduce and it's always 07:02 reading its input from the GFS from disk 07:05 and the GFS filesystem and always 07:07 writing its output which would be this 07:09 sort of updated per page ranks every 07:12 stage also writes those updated per page 07:17 ranks to files in GFS also so there's a 07:19 lot of file i/o if you run this as sort 07:23 of a sequence of MapReduce applications 07:26 all right so we have here this sum 07:31 there's an a PageRank code that came 07:32 with um came a spark I'm actually gonna 07:35 run it for you I'm gonna run the whole 07:38 thing for you 07:40 this code shown here on the input that 07:42 I've shown just to see what the final 07:44 output is and then I'll look through and 07:46 we're going to step by step and 07:52 show how it executes alright so here's 07:56 the you should see a screen share now at 08:02 a terminal window and I'm showing you 08:05 the input file then I got a hand to this 08:10 PageRank program and now here's how I 08:14 read it I've you know I've downloaded a 08:17 copy of SPARC to my laptop it turns out 08:19 to be pretty easy and if it's a pre 08:23 compiled version of it I can just run it 08:27 just runs in the Java Virtual Machine I 08:29 can run it very easily so it's actually 08:30 doing downloading SPARC and running 08:33 simple stuff turns out to be pretty 08:35 straightforward so I'm gonna run the 08:37 code that I show with the input that I 08:40 show and we're gonna see a lot of sort 08:43 of junk error messages go by but in the 08:48 end support runs the program and prints 08:52 the final result and we get these three 08:53 ranks for the three pages I have and 08:56 apparently page one has the highest rank 09:02 and I'm not completely sure why but 09:06 that's what the algorithm ends up doing 09:09 so you know of course we're not really 09:10 that interested in the algorithm itself 09:13 so much as how we execute arc execute 09:18 sit all right so I'm gonna hand to 09:26 understand what the programming model is 09:29 and spark because it's perhaps not quite 09:33 what it looks like I'm gonna hand the 09:36 program line by line to the SPARC 09:40 interpreter so you can just fire up this 09:44 spark shell thing and type code to it 09:49 directly so I've sort of prepared a 09:53 version of the MapReduce program that I 09:57 can run a line at a time here so the 10:01 first line is this line in which it 10:05 reads the or asking SPARC to read this 10:08 input file and it's you know the input 10:11 file I showed with the three pages in it 10:16 okay so one thing there notice here is 10:19 is that when Sparky's a file what is 10:23 actually doing is reading a file from a 10:27 GFS like distributed file system and 10:29 happens to be HDFS the Hadoop file 10:33 system but this HDFS file system is very 10:36 much like GFS so if you have a huge file 10:38 as you would with got a file with all 10:40 the URLs all the links and the web on it 10:44 on HDFS is gonna split that file up 10:46 among lots and lots you know bite by 10:49 chunks it's gonna shard the file over 10:51 lots and lots of servers and so what 10:54 reading the file really means is that 10:57 spark is gonna arrange to run a 11:00 computation on each of many many 11:02 machines each of which reads one chunk 11:05 or one partition of the input file and 11:10 in fact actually the system ends up or 11:13 HDFS ends up splitting the file big 11:16 files typically into many more 11:17 partitions 11:19 then there are worker machines and so 11:22 every worker machine is going to end up 11:23 being responsible for looking at 11:26 multiple partitions of the input files 11:28 this is all a lot like the way map works 11:33 mapreduce okay so this is the first line 11:37 in the program and you may wonder what 11:41 the variable lines actually hold so in 11:44 printed the result of lines but with the 11:46 lines points - it turns out that even 11:50 though it looks like we've typed a line 11:53 of code that's asking the system to read 11:55 a file in fact it hasn't read the file 11:58 and won't read the file for a while what 12:02 we're really building here with this 12:03 code what this code is doing is not 12:07 causing the input to be processed 12:09 instead what this code does is builds a 12:13 lineage graph it builds a recipe for the 12:16 computation we want like a little kind 12:19 of lineage graph that you see in Figure 12:20 three in the paper so what this code is 12:23 doing it's just building the lineage 12:25 graph building the computation recipe 12:27 and not doing the computation when the 12:30 computations only gonna actually start 12:32 to happen once we execute what the paper 12:35 calls an action which is a function like 12:39 collect for example to finally tell mark 12:42 oh look I actually want the output now 12:44 please go and actually execute the 12:48 lineage graph and tell me what the 12:50 result is 12:51 so what lines holds is actually a piece 12:53 of the lineage graph not a result now in 12:58 order to understand what the computation 13:01 will do when we finally run it we could 13:03 actually ask SPARC at this point we can 13:07 ask the interpreter to please go ahead 13:09 and tell us what you know I actually 13:14 execute the lineage graph up to this 13:16 point and tell us what the results are 13:19 so and you do that by calling an action 13:22 I'm going to call collect which so just 13:24 prints out all the results of executing 13:27 the lineage graph so far and what we're 13:31 expecting to see here is you know all 13:32 we've asked it to do so far the lineage 13:34 graph just says please read a file so 13:36 we're expecting to see that the final 13:38 output is just the contents of the file 13:40 and indeed that's what we get and what 13:44 what 13:46 this lineage graph this one 13:48 transformation lineage graph is results 13:52 in is just the sequence of lines one at 13:57 a time so it's really a set of lines a 14:01 set of strings each of which contains 14:03 one line of the input alright so that's 14:05 the first line of the program the second 14:10 line is is collect essentially just 14:17 just-in-time compilation of the symbolic 14:19 execution chain yeah yeah yeah yeah 14:23 that's what's going on so what collect 14:25 does is it actually huge amount of stuff 14:28 happens if you call collect 14:30 it tells SPARC to take the lineage graph 14:34 and produce java bytecodes 14:37 that describe all the various 14:39 transformations you know which in this 14:40 case it's not very much since we're just 14:42 reading a file but so SPARC well when 14:45 you call collect SPARC well figure out 14:48 where the data is you want by looking 14:50 HDFS it'll you know just pick a set of 14:53 workers to run to process the different 14:57 partitions of the input data it'll 14:59 compile the lineage graph and we reach 15:01 transformation in the lineage graph into 15:03 java bytecodes it sends the byte codes 15:05 out to the all the worker machines that 15:08 spark chose and those worker machines 15:10 execute the byte codes and the byte 15:15 codes say oh you know please read tell 15:18 each worker to read it's partition at 15:19 the input and then finally collect goes 15:24 out and fetches all the resulting data 15:27 back from the workers and so again none 15:32 of this happens until you actually 15:33 wanted an action and we sort of 15:34 prematurely run collect now you wouldn't 15:37 ordinarily do that I just because I just 15:39 want to see what the the output is to 15:40 understand what the transformations are 15:43 doing okay 15:46 if you look at the code that I'm showing 15:51 the second line is this map call so the 15:58 leave so line sort of refers to the 16:01 output of the first transformation which 16:03 is the set of strings correspond to 16:06 lines in the input we're gonna call map 16:09 we've asked the system call map on that 16:11 and what map does is it runs a function 16:13 over each element of the input that is 16:16 in this case or each line of the input 16:18 and that little function is the S arrow 16:22 whatever which basically describes a 16:25 function that calls the split function 16:27 on each line split just takes a string 16:30 and returns a array of strings broken at 16:34 the places where there are spaces and 16:37 the final part of this line that refers 16:39 to parts 0 & 1 says that for each line 16:42 of input we want to at the output of 16:44 this transformation be the first string 16:48 on the line and then the second string 16:51 of the line so we're just doing a little 16:52 transformation to turn these strings 16:54 into something that's a little bit 16:55 easier to process and again at a 16:59 curiosity 17:00 I'm gonna call collect on links one just 17:02 to verify that we understand what it 17:04 does and you can see where as lines held 17:09 just string lines links one now holds 17:13 pairs of strings of from URL and to URL 17:18 one for each link and when this executes 17:26 this map executes it can execute totally 17:28 independently on each worker on its own 17:30 partition of the input because it's just 17:32 considering each line independently 17:34 there's no interaction involved between 17:37 different lines or different partitions 17:39 these are it's running if these this map 17:41 is a purely local operation on each 17:45 input record so can run totally in 17:47 parallel on all the workers on all their 17:49 partitions ok the next line in the 17:55 program is this called the distinct and 17:59 what's going on here is that we only 18:02 want to count each link once so if a 18:04 given page has multiple links to another 18:07 page we want to only consider one of 18:10 them for the purposes of PageRank and so 18:15 this just looks for duplicates now if 18:17 you think about what it actually takes 18:19 to look for duplicates in a you know 18:23 multi terabyte collection of data items 18:28 it's no joke 18:30 because the data items are in some 18:32 random order and the input and what 18:34 distinct needs to do since an e sirup 18:36 replace each duplicated input with a 18:39 single input distinct needs to somehow 18:42 bring together all of the items that are 18:45 identical and that's going to require 18:48 communication remember that all these 18:49 data is spread out over all the workers 18:51 we want to make sure that any you know 18:54 that we bring we sort of shuffle the 18:55 data around so that any two items that 18:58 are identical or on the same worker so 18:59 that that worker can do this I'll wait a 19:00 minute there's three of these I'm gonna 19:02 replace it these three with a single one 19:04 I mean that means that distinct when it 19:06 finally comes to execute requires 19:09 communication it's a shuffle and so the 19:12 shuffle is going to be driven by either 19:13 hashing the items the hashing the items 19:17 to pick the worker that will process 19:18 that item and then sending the item 19:20 across the network or you know possibly 19:22 you could be implemented with a sort or 19:24 the system sort of sorts all the input 19:26 and then splits up the sorted input 19:31 overall the workers I'd actually don't 19:35 know which it does but anyway I'm gonna 19:37 require a lot of computation in this 19:40 case however almost fact nothing 19:42 whatsoever happens because there were no 19:44 duplicates and sorry whoops 19:49 links to all right so anyone collect and 19:54 the links to which is the output a 19:58 distinct is basically except for order 20:02 identical two links one which was the 20:05 input to that transformation and the 20:07 orders change because of course it has 20:09 to hash or sort or something all right 20:12 the next the next transformation is is 20:19 grouped by key and here what we're 20:24 heading towards is we want to collect 20:27 all of the links it turns out for the 20:33 computation with little C we want to 20:35 collect together all the links from a 20:36 given page into one place so the group 20:40 by key is gonna group by it's gonna move 20:43 all the records all these from two URL 20:45 pairs it's gonna group them by the from 20:47 URL that is it's gonna bring together 20:51 all the links that are from the same 20:55 page and it's gonna actually collapse 20:57 them down into the whole collection of 20:59 links from each page is gonna collapse 21:02 them down into a list of links into that 21:04 pages URL plus a list of the links that 21:07 start at that page and again this is 21:10 gonna require communication although 21:17 spark I suspect spark is clever enough 21:19 to optimize this because the distinct 21:21 already um put all records with the same 21:27 from URL on the same worker the group by 21:31 key could easily and may well just I'm 21:35 not have to communicate at all because 21:36 it can observe that the data is already 21:38 grouped by the from URL key all right so 21:41 let's print links three 21:44 let's run collect actually drive the 21:47 computation and see what the result is 21:52 and indeed what we're looking at here is 21:55 an array of couples where the first part 21:59 of each tuple is the URL the from page 22:01 and the second is the list of links that 22:05 start at that front page and so you can 22:08 see the YouTube has a link to you two 22:10 and three you three as a link to just u 22:12 1 and u 1 has a link to u 1 & u 3 okay 22:22 so that's link 3 now the iteration is 22:29 going to start in a couple lines from 22:30 here it's gonna use these things over 22:32 and over again each iteration of the 22:35 loop is going to use this this 22:39 information in links 3 in order to sort 22:44 of propagate probabilities in order to 22:46 sort of simulate these user clicking I'm 22:49 from from all pages to all other link to 22:52 two pages so this length stuff is these 22:55 links data is gonna be used over and 22:57 over again and we're gonna want to save 22:58 it it turns out that each time I've 23:00 called collect so far spark has 23:02 re-execute 'add the computation from 23:05 scratch so every call to collect I've 23:06 made has involved spark rereading the 23:09 input file re running that first map 23:12 rerunning the distinct and if I were to 23:14 call collect again it would rerun this 23:17 route by key but we don't want to have 23:18 to do that over and over again on sort 23:20 of multiple terabytes of links for each 23:25 loop iteration because we've computed it 23:28 once and it's gonna state this list of 23:29 links is gonna stay the same we just 23:31 want to save it and reuse it so in order 23:34 to tell spark that look we want to use 23:38 this over and over again the programmer 23:39 is required to explicitly what the paper 23:43 calls persist this data and in fact 23:48 modern spark the function you call 23:51 not persist if you want to sleep in a 23:53 memory but but it's called cash and so 23:55 links for is just identical the links we 23:59 accept with the annotation that we'd 24:03 like sparked keep links for in memory 24:06 because we're gonna use it over and over 24:07 again ok so that the last thing we need 24:14 to do before the loop starts is we're 24:16 gonna have a set of page ranks for every 24:20 page indexed by source URL and we need 24:23 to initialize every pages rank it's not 24:28 really ranks here it's kind of 24:29 probabilities we're gonna initialize all 24:33 the probabilities to one so they all 24:35 start out with a probability one with 24:38 the same rank but we're gonna well we're 24:41 gonna actually you code that looks like 24:43 it's changing ranks but in fact when we 24:49 execute the loop in the code I'm showing 24:52 it really produces a new version of 24:54 ranks for every loop iteration that's 24:56 updated to reflect the fact that the 25:00 code algorithm is kind of pushed page 25:02 ranks from each from each P 25:07 to the page is that it links to so let's 25:10 print ranks also to see what's inside 25:13 it's just a mapping from URL from source 25:17 URL to the current page rank value for 25:20 every page ok not gonna start executing 25:23 inside the spark allow the user to 25:27 request more fine-grained scheduling 25:30 primitives than cache that is to control 25:32 where that is stored or how the 25:33 computations are performed well yeah so 25:38 cache cache is a special case of a more 25:41 general persist call which can tell 25:44 spark look I want to you know save this 25:46 data in memory or I want to save it in 25:49 HDFS so that it's replicated and all 25:52 survived crashes so you got a little 25:53 flexibility there in general you know we 25:58 didn't have to say anything about the 26:00 partitioning in this code and spark will 26:04 just choose something at first the 26:07 partitioning is driven by the 26:09 partitioning of the original input files 26:11 but when we run transformations that had 26:16 to shuffle had to change the 26:17 partitioning like distinct it does that 26:19 and group by key does that spark will do 26:21 something internally that if we don't do 26:25 any we don't say anything it'll just 26:26 pick some scheme like hashing the keys 26:28 over the available workers for example 26:30 but you can tell it look you know I it 26:34 turns out that this particular way of 26:35 partitioning the data you know use a 26:39 different hash function or maybe 26:40 partitioned by ranges instead of hashing 26:42 you can tell it if you like more clever 26:46 ways to control the partitioning okay so 26:53 I'm about to start 26:55 the first thing the loop does and I hope 26:57 you can see the the code on line 12 we 27:02 actually gonna run this join this is the 27:06 first statement of the first iteration 27:08 of the loop with this joint is doing is 27:12 joining the links with the ranks and 27:17 what that does is pull together the 27:20 corresponding entries in the links which 27:22 said for every URL what is the point 27:24 what does it have links to and I'm sort 27:28 of putting together the links with the 27:29 ranks and but the rank says is for every 27:31 URL what's this current PageRank so now 27:33 we have together and a single item for 27:38 every page 27:39 both what its current PageRank is and 27:41 what links it points to because we're 27:43 gonna push every pages current PageRank 27:47 to all the pages it appoints to and 27:50 again this joint is uh is what the paper 27:52 calls a wide transformation because it 27:57 doesn't it's not a local the I mean it 28:04 needs to it may need to shuffle the data 28:07 by the URL key in order to bring 28:10 corresponding elements of links and 28:13 ranks together now in fact I believe 28:17 spark is clever enough to notice that 28:19 links and ranks are already partitioned 28:23 by key in the same way actually that 28:27 assumes that it cleverly created links 28:30 well when we created ranks its assumes 28:33 that it cleverly created 28:34 ranks using the same hash scheme as used 28:39 when it created links but if it was that 28:41 clever then it will notice that links 28:43 and ranks are passed in the same way 28:45 that is to say that the links ranks are 28:48 already on the same workers or sorry the 28:53 corresponding partitions with the same 28:55 keys are already in the same workers and 28:57 hopefully spark will notice that and not 29:00 have to move any data around if 29:01 something goes wrong though in links and 29:03 ranks are partitioned in different ways 29:04 then data will have to move at this 29:05 point 29:06 to join up corresponding keys in the two 29:10 and the two rdd's alright so JJ 29:15 contained now contains both every pages 29:17 rank and every pages list of links as 29:25 you can see now we have a even more 29:28 complex data structure it's an array 29:31 with an element per page with the pages 29:34 URL with a list of the links and the one 29:37 point over there is the page you choose 29:40 current rank and these are all all this 29:45 information is any sort of a single 29:47 record that has all this information for 29:48 each page together where we need it 29:52 alright the next step is that we're 29:56 gonna figure out every page is gonna 29:58 push a fraction of its current page rank 30:02 to all the pages that it links to it's 30:04 kind of sort of divided up its current 30:05 page rank among all the pages it links 30:07 to 30:11 and that's what this contribs does you 30:16 know basically what's going on is that 30:18 it's a one another one call to map and 30:23 we're mapping over the for each page 30:27 were running map over the URLs that that 30:29 pages points to and for each page it 30:32 points to we're just calculating this 30:37 number which is the from pages current 30:39 rank divided by the total number of 30:41 pages that points to so this sort of 30:44 figured you know creates a mapping from 30:47 link name to one of the many 30:50 contributions to that pages new page 30:55 rank and we can sneak peek it what this 31:04 is gonna produce I think is a much 31:07 simpler thing it just as a list of URLs 31:10 and contributions to the URLs page ranks 31:13 and there's there's more there's you 31:15 know more than one record for each URL 31:16 here because there's gonna for any given 31:19 page there's gonna be a record here for 31:21 every single link that points to it 31:22 indicating this contribution of from 31:27 whatever that link came from to this 31:29 page to this pages new updated PageRank 31:32 what has to happen now is that we need 31:35 to sum up for every page we need to sum 31:38 up the PageRank contributions for that 31:42 page that are in contribs so again we 31:44 going to need to do a shuffle here it's 31:46 gonna be a wide a transformation with a 31:49 wide input because we need to bring 31:50 together all of the elements of contribs 31:55 for each page we need to bring together 31:57 and to the same worker to the same 31:59 partition so they can all be summed up 32:03 and the way that's done the bay PageRank 32:07 does that is with this reduced by key 32:10 call would reduce spike he does is 32:15 it first of all it brings together all 32:17 the records with the same key and then 32:19 sums up the second element of each one 32:24 of those records for a given key and 32:26 produces as output the key which is a 32:30 URL and the sum of the numbers which is 32:33 the updated PageRank there's actually 32:39 two transformations here the first ones 32:40 is reduced by key and the second is this 32:43 map values which and and this is the 32:46 part that implements the 15% probability 32:49 of going to a random page and the 85% 32:52 chance of following a link all right 33:00 let's look at ranks by the way even 33:02 though we've assigned two ranks here um 33:04 what this is going to end up doing is 33:06 creating an entirely new transformation 33:08 I'm so not it's not changing the value 33:12 is already computed or when it comes to 33:14 executing this it won't change any 33:16 values are already computed it just 33:17 creates a new a new transformation with 33:20 new output and we can see what's gonna 33:27 happen in indeed we now have member 33:29 ranks originally was just a bunch of 33:32 pairs of URL PageRank now again we 33:35 appears if you are I'll page rank 33:37 another different we'd actually updated 33:38 them sort of changed them by one step 33:43 and I don't know if you remember the 33:45 original PageRank values we saw but 33:48 these are closer to those final output 33:51 that we saw then the original values of 33:54 all one are okay so that was one 33:58 iteration of the algorithm when the loop 34:01 goes back up to the top it's gonna do 34:02 the same join flat map and reduce by key 34:08 and each time it's again you know what 34:13 the loop is actually doing is producing 34:15 this lineage graph and so it's not 34:18 updating the variables that are 34:20 mentioned in the loop it's really 34:21 creating essentially appending new 34:25 transformation nodes to the lineage 34:27 graph that it's building 34:30 but I've only run that Elite once after 34:34 the loop and then now this is what the 34:37 real code does the real code actually 34:39 runs collect at this point and so they 34:42 were in the real PageRank implementation 34:43 only at this point with the computation 34:46 even start because of the call to 34:49 collect here and I go off and read the 34:50 end burden we're on the input through 34:52 all these transformations and shuffles 34:54 for the wide dependencies and finally 34:57 collect the output together on the 34:59 computer that's running this program by 35:02 the way the computer that runs the 35:03 program that the paper calls it the 35:05 driver the driver computer is the one 35:07 that actually runs this scallop program 35:09 that's kind of driving the spark 35:13 computation and then the program takes 35:15 this output variable and runs it through 35:18 a nice nicely formatted print on each of 35:26 the records in the collect up okay so 35:35 that's the 35:39 kind of style of programming that people 35:41 use for Scala and I mean for for spark 35:51 went one thing to note here relative to 35:54 MapReduce is that this program well you 35:57 know and look looks a little bit complex 35:59 but the fact is that this program is 36:02 doing the work of many many MapReduce or 36:07 doing an amount of work that would 36:09 require many separate MapReduce programs 36:12 in order to implement so you know it's 36:16 21 lines and maybe you used two 36:18 MapReduce programs that are simpler than 36:20 that but this is doing a lot of work for 36:22 21 lines and it turns out that this is 36:25 you know this is sort of a real 36:27 algorithm to so it's like a pretty 36:29 concise and easy program easy to program 36:32 way to express vast Big Data 36:37 computations you know people like pretty 36:42 successful okay so again 36:50 just want to repeat that until the final 36:52 collect or this code is doing is 36:54 generating a lineage graph and not 36:56 processing the data and the the lineage 36:58 graph that it produces actually the 37:01 paper 37:01 I'm just copied this from the paper this 37:05 is what the lineage graph looks like 37:06 it's you know this is all that the 37:09 program is producing it's just this 37:11 graph until the final collect and you 37:14 can see that it's a sequence of these 37:16 processing stage where we read the file 37:20 to produce links and then completely 37:21 separately we produce these initial 37:23 ranks and then there's repeated joins 37:26 and reduced by key pairs each loop 37:34 iteration produces a join and a each of 37:41 these pairs is one loop iteration and 37:42 you can see again that the loop is 37:44 appended more and more nodes to the 37:46 graph rather than what it is not doing 37:49 in particular it is not producing a 37:51 cyclic graph the loop is producing all 37:56 these graphs are a cyclic another thing 37:59 to notice that you wouldn't have seen a 38:01 MapReduce is that this data here which 38:03 was the data that we cashed that we 38:05 persisted is used over and over again 38:07 and every loop iteration and so it 38:09 sparks going to keep this in memory and 38:12 it's going to consult it multiple times 38:20 alright so it actually happens during 38:26 execution what is the execution look 38:28 like so again the the assumption is that 38:36 the data the input data starts out kind 38:39 of pre partitioned by over in HDFS 38:45 we assume our one file it's our input 38:48 files already split up into lots of you 38:51 know 64 megabyte or whatever it may 38:53 happen pieces in HDFS spark knows that 38:58 when you started you actually call 39:01 collect the start of computation spark 39:02 knows that the input data is already 39:03 partitioned HDFS and it's gonna try to 39:08 split up the work the workers in a 39:11 corresponding way so if it knows that 39:13 there's I actually don't know what the 39:15 details are a bit it might actually try 39:19 to run the computation on the same 39:21 machines that store the HDFS data or it 39:25 may just set up a bunch of workers to 39:31 read each of the HDFS partitions and 39:35 again there's likely to be more than one 39:37 partition per per worker so we have the 39:41 input file and the very first thing is 39:45 that each worker reads as part of the 39:50 input file so this is the read their 39:53 file read if you remember the next step 39:55 is a map where the each worker supposed 39:57 to map a little function that splits up 39:59 each line of input into a from two 40:02 linked tupple um but this is a purely 40:06 local operation and so it can go on in 40:08 the same worker so we imagine that we 40:10 read the data and then in the very same 40:13 worker spark is gonna do that initial 40:16 map so you know I'm drawing an arrow 40:19 here's really an arrow from each worker 40:21 to itself so there's no network 40:22 communication involved indeed it's just 40:24 you know we run the first read and the 40:28 output can be directly fed to that 40:30 little map function and in fact this is 40:33 that that initial map in fact spark 40:39 certainly streams the data record by 40:40 record through these transformations so 40:43 instead of reading the entire input 40:45 partition and then running the map on 40:47 the entire input partition SPARC reads 40:52 the first record or maybe the first just 40:53 couple of records and then runs the map 40:56 on just sort of all I'm each record in 40:58 fact runs each record of E if it was 41:02 many transformations as it can before 41:05 going on and reading the next little bit 41:06 from the file and that's so that it 41:08 doesn't have to store yes these files 41:10 could be very large it isn't one half so 41:13 like store the entire input file it's 41:14 much more efficient just to process it 41:16 record by record okay so there's a 41:18 question so the first node in each chain 41:22 is the worker holding the HDFS chunks 41:24 and the remaining nodes in the chain are 41:26 the nodes in the lineage oh yeah I'm 41:28 afraid I've been a little bit confusing 41:29 here I think the way to think of this is 41:32 that so far all this happen is happening 41:35 on it on individual workers so this is 41:37 worker one maybe this is another worker 41:40 and 41:45 each worker is sort of proceeding 41:48 independently and I'm imagining that 41:50 they're all running on the same machines 41:53 that stored the different partitions of 41:55 the HTTPS fob but there could be Network 41:57 communication here to get from HDFS to 41:59 the to the responsible worker but after 42:02 that it's very fast kind of local 42:04 operations all right and so this is what 42:17 happens for the with the people called 42:20 the narrow 42:23 dependencies that is transformations 42:25 that just look consider each record of 42:28 data independently without ever having 42:30 to worry about the relationship to other 42:33 records so by the way this is already 42:37 potentially more efficient than 42:39 MapReduce and that's because if we have 42:43 what amount to multiple map phases here 42:46 they just string together in memory 42:48 whereas MapReduce if you're not super 42:50 clever 42:51 if you run multiple MapReduce is even if 42:54 they're sort of degenerate map only 42:56 MapReduce applications each stage would 42:59 reduce input from G of s compute and 43:02 write its output back to GFS then the 43:04 next stage would be compute right so 43:07 here we've eliminated the reading 43:08 writing in it you know it's not a very 43:10 deep advantage but it sure helps 43:14 enormous Li for efficiency okay however 43:20 not all the transformations are narrow 43:23 not all just sort of read their input 43:26 record by record kind of with every 43:28 record independent from other records 43:30 and so what I'm worried about is the 43:32 distinct call which needed to know all 43:34 instances all records that had a 43:37 particular key similarly group by key 43:39 needs to know about all instances that 43:42 have a key join also it's gotta move 43:45 things around so that takes two inputs 43:50 needs to join together all keys from 43:53 both inputs so that this all records 43:54 from both inputs that are the same key 43:56 so there's a bunch of these non-local 43:58 transformations which the paper calls 44:01 wide transformations because they 44:04 potentially have to look at all 44:05 partitions of the input that's a lot 44:08 like reduce in MapReduce serve example 44:12 distinct exposing we're talking about 44:14 the distinct stage you know the distinct 44:18 is going to be run on multiple workers 44:20 also and no distinct works on each key 44:24 independently and so we can partition 44:27 the computation by key but the 44:31 data currently is not partitioned by key 44:33 at all actually isn't really partitioned 44:34 by anything but just sort of however 44:36 HDFS have my distorted so four distinct 44:41 we're gonna run distinct on all the word 44:44 partition and all the workers 44:46 partitioned by key but you know any one 44:50 worker needs to see all of the input 44:52 records with a given key which may be 44:54 spread out over all of the preceding 45:00 workers for the preceding transformation 45:04 and all of all of the you know they're 45:07 all for the workers are responsible for 45:09 different keys but the keys may be 45:10 spread out over 45:16 workers for the preceeding 45:19 transformation now in fact the workers 45:21 are the same typically it's gonna be the 45:23 same workers running the map is running 45:25 running the distinct but the data needs 45:27 to be moved between the two 45:28 transformations to bring all the keys 45:30 together and so what sparks actually 45:33 gonna do it's gonna take the output of 45:34 this map hash the each record by its key 45:38 and use that you know mod the number of 45:40 workers to select which workers should 45:42 see it and in fact the implementation is 45:46 a lot like your implementation of 45:48 MapReduce the very last thing that 45:51 happens in in the last of the narrow 45:55 stages is that the output is going to be 45:59 chopped up into buckets corresponding to 46:01 the different workers for the next 46:05 transformation where it's going to be 46:06 left waiting for them to fetch I saw the 46:10 scoop is that each of the workers run 46:13 the sort of as many stages all the 46:15 narrows stages they can through the 46:16 completion and store the output split up 46:19 into buckets when all of these are 46:21 finished then we can start running the 46:24 workers for the distinct transformation 46:27 whose first step is go and fetch from 46:30 every other worker the relevant bucket 46:32 of the output of the last narrow stage 46:35 and then we can run the distinct because 46:38 all the given keys are on the same 46:40 worker and they can all start producing 46:42 output themselves 46:48 all right now of course these Y 46:50 transformations are quite expensive the 46:52 now transformations are super efficient 46:54 because we're just sort of taking each 46:56 record and running a bunch of functions 46:58 on it totally locally the Y 47:00 transformations require pushing a lot of 47:02 data impact essentially all of the data 47:04 in for PageRank you know you get 47:06 terabytes of input data that means that 47:08 you know it's still the same data at 47:10 this stage because it's all the links 47:12 and then in the web so now we're pushing 47:15 terabytes and terabytes of data over the 47:17 network to implement this shuffle from 47:19 the output of the map functions to the 47:23 input of the distinct functions so these 47:24 wide transformations are pretty 47:28 heavyweight 47:28 a lot of communication and they're also 47:31 kind of computation barrier because we 47:33 have to wait all for all the narrow 47:35 processing to finish before we can go on 47:37 to the so there's wide transformation 47:45 all right that said the there are some 47:54 optimizations that are possible because 47:57 SPARC has a view SPARC creates the 47:59 entire lineage graph before it starts 48:04 any of the data processing so smart can 48:06 inspect the lineage graph and look for 48:08 opportunities for optimization and 48:10 certainly running all of if there's a 48:13 sequence of narrow stages running them 48:15 all in the same machine by basically 48:17 sequential function calls on each input 48:19 record that's definitely an optimization 48:21 that you can only notice if you sort of 48:24 see the entire lineage graph all at once 48:26 another optimization that 48:34 spark does is noticing when the data has 48:37 all has has already been partitioned due 48:40 to a wide shuffle that the data is 48:42 already partitioned in the way that it's 48:44 going to be needed for the next wide 48:47 transformation so in the in our original 48:51 program let's see I think we have two 48:57 wide transformations in a row distinct 49:00 requires a shuffle but group by key also 49:02 it's gonna bring together all the 49:05 records with a given key and replace 49:08 them with a list of for every key the 49:11 list of links you know starting at that 49:14 URL these are both wide operators they 49:16 both are grouping by key and so maybe we 49:19 have to do a shuffle for the distinct 49:21 but spark can cleverly recognize a high 49:24 you know that is already shuffled in a 49:25 way that's appropriate for a group by 49:26 key we don't have to do in other shuffle 49:28 so even though group by key is in 49:30 principle it could be a wide 49:32 transformation in fact I suspect spark 49:36 implements it without communication 49:38 because the data is already partitioned 49:39 by key so maybe the group by key 49:45 can be done in this particular case 49:48 without shuffling data without expense 49:53 of course it you know can only do this 49:56 because it produced the entire lineage 49:58 graph first and only then ran the 50:00 computation so this part gets a chance 50:02 to sort of examine and optimize and 50:07 maybe transform the graph 50:13 so that looks topic actually any any 50:16 questions about lineage graphs or how 50:20 things are executed 50:21 I feel free to interact the next thing I 50:28 want to talk about is fault tolerance 50:33 and here the you know these kind of 50:40 computations they're not the fault 50:41 tolerance are looking for is not the 50:42 sort of absolute fault tolerance you 50:45 would want with the database what you 50:46 really just cannot ever afford to lose 50:48 anything what you really want is a 50:49 database that never loses data here the 50:53 fault tolerance we're looking for is 50:54 more like well it's expensive if we have 50:58 to repeat the computation we can totally 51:00 repeat this computation if we have to 51:02 but you know it would take us a couple 51:04 of hours and that's irritating but not 51:06 the end of the world so we're looking to 51:09 you know tolerate common errors but we 51:12 don't have to certainly don't have to 51:15 having bulletproof ability to tolerate 51:19 any possible error so for example spark 51:26 doesn't replicate that driver machine if 51:29 the driver which was sort of controlling 51:31 the computation and knew about the 51:33 lineage graph of the driver crashes I 51:34 think you have to rerun the whole thing 51:36 but you know any only any one machine 51:38 only crashes maybe every few months so 51:40 that's no big deal 51:41 another thing to notice is that HDFS is 51:45 sort of a separate thing SPARC is just 51:48 assuming that the input is replicated in 51:52 a fault-tolerant way on HDFS and indeed 51:55 just just like GFS HDFS does indeed keep 51:58 multiple copies of the data on multiple 51:59 servers if one of them crashes can 52:02 soldier on with the other copy so the 52:05 input data is assumed to be to be 52:09 relatively fault tolerant and 52:11 what that means that at the highest 52:12 level is that spark strategy if one of 52:17 the workers fail is just to recompute 52:20 the whatever that worker was responsible 52:23 for to just repeat those computations 52:26 they were lost with the worker on some 52:29 other worker and on some other machine 52:32 so that's basically what's going on and 52:37 it you know it might take a while if you 52:40 have a long lineage like you would 52:42 actually get with PageRank because you 52:44 know PageRank with many iterations 52:45 produces a very long lineage graph one 52:50 way that spark makes it not so bad that 52:53 it has to be may have to be computer 52:55 everything from scratch if a worker 52:56 fails is that each workers actually 53:00 responsible for multiple partitions at 53:02 the input so spark can move those parts 53:06 move give each remaining worker just one 53:08 of the partitions and they'll be able to 53:10 basically paralyzed the recomputation 53:13 that was lost with the failed worker by 53:17 running each of its partitions on a on a 53:19 different worker in parallel so if all 53:22 else fails spark just goes back to the 53:24 beginning from being input and just 53:27 recomputes everything that was running 53:29 on that machine however and for now our 53:36 dependencies that's pretty much the end 53:38 of the story 53:39 however there actually is a problem with 53:42 the wide dependencies that makes that 53:43 story not as attractive as you might 53:48 hope so this is a topic here is failure 53:53 one failed node 1 failed worker 54:00 in a lineage graph that has wide 54:05 dependencies so the a reasonable or a 54:13 sort of sample graph you might have is 54:14 you know maybe you have a dependency 54:16 graph that's you know starts with some 54:19 power dependencies but then after a 54:26 while you have a wide dependency so you 54:29 got transformations that depend on all 54:37 the preceding transformations and then 54:39 some small narrow ones all right and you 54:44 know the game is that a single workers 54:45 fail and we need to reconstruct the 54:48 Maeby's field before we've gone to the 54:50 final action and produce the output so 54:55 we need to kind of reconstruct recompute 54:57 what was on this field work the the 55:02 damaging thing here is that ordinarily 55:05 as spark is executing along it you know 55:09 it executes each of the transformations 55:12 gives us output to the next 55:14 transformation but doesn't hold on to 55:16 the original output unless you unless 55:18 you happen to tell it to like the links 55:20 data is persisted with that cache call 55:24 but in general that data is not held on 55:27 to because now if you have a like the 55:31 PageRank lineage graph maybe dozens or 55:34 hundreds of steps long you don't want to 55:35 hold on to all that data it's way way 55:37 too much to fit in memory so as the 55:40 SPARC sort of moves through these 55:42 transformations it discards all the data 55:45 associated with earlier transformations 55:48 that means when we get here and if this 55:50 worker fails we need to we need to 55:53 restart its computation on a different 55:56 worker now so we can be the input and 55:58 maybe do the original narrow 56:01 transformations 56:04 they just depend on the input which we 56:05 have to reread but then if we get to 56:07 this y transformation we have this 56:08 problem that it requires input not just 56:11 from the same partition on the same 56:13 worker but also from every other 56:15 partition and these workers so they're 56:18 still alive have in this example have 56:20 proceeded past this transformation and 56:23 therefore discarded the output of this 56:28 transformation since it may have been a 56:31 while ago and therefore the input did 56:33 our recomputation needs from all the 56:36 other partitions doesn't exist anymore 56:39 and so if we're not careful that means 56:41 that in order to rebuild this the 56:44 computation on this field worker we may 56:46 in fact have to re execute this part of 56:51 every other worker as well as well as 56:55 the entire lineage graph on the failed 56:58 worker and so this could be very 57:01 damaging right if we're talking about oh 57:02 I mean I've been running this giant 57:05 spark job for a day and then one of a 57:07 thousand machines fails that may mean we 57:10 have to we know anything more clever 57:12 than this that we have to go back to the 57:13 very beginning on every one of the 57:15 workers and recompute the whole thing 57:18 from scratch no it's gonna be the same 57:21 amount of work is going to take the same 57:22 day to recompute a day's computation so 57:27 this would be unacceptable we'd really 57:30 like it so that if if one worker out of 57:32 a thousand crashes that we have to do 57:33 relatively little work to recover from 57:36 that and because of that spark allows 57:42 you to check point to make periodic 57:46 check points of specific transformation 57:48 so um so in this graph what we would do 57:52 is in the scallop program we would call 57:57 I think it's the persist call actually 57:59 we call the persist call with a special 58:00 argument that says look after you 58:04 compute the output of this 58:06 transformation please save the output to 58:09 HDFS 58:11 and so everything and then if something 58:14 fails the spark will know that aha the 58:18 output of the proceeding transformation 58:21 was safe th th d fs and so we just have 58:24 to read it from each DFS instead of 58:28 recomputing it on all for all partitions 58:30 back to the beginning of time um and 58:34 because HDFS is a separate storage 58:36 system which is itself replicated in 58:38 fault-tolerant the fact that one worker 58:40 fails you know the HDFS is still going 58:43 to be available even if a worker fails 58:49 so I think so for our example PageRank I 58:55 think what would be traditional would be 58:59 to tell 59:02 spark to check point the output to check 59:06 put ranks and you wouldn't even know you 59:08 can tell it to only check point 59:10 periodically so you know if you're gonna 59:12 run this thing for 100 iterations it 59:16 actually takes a fair amount of time to 59:18 save the entire ranks to HDFS because 59:24 again we're talking about terabytes of 59:25 data in total so maybe we would we can 59:28 tell SPARC look only check point ranks 59:31 to HDFS every every 10th iteration or 59:38 something to limit the expanse although 59:40 you know it's a trade-off between the 59:42 expensive repeatedly saving stuff to 59:44 disk and how much of a cost if a worker 59:48 failed you had to go back and redo it 59:55 Bertha's a question when we call 59:59 that does act as a checkpoint you know 60:02 okay so this is a very good question 60:03 which I don't know the answer to the 60:05 observation is that we could call cash 60:08 here and we do call cashier and we could 60:10 call cashier and the usual use of cash 60:14 is just to save data in memory with the 60:18 intent to reuse it that's certainly why 60:21 it's being called here because we're 60:22 using links for but in my example it 60:26 would also have the effect of making the 60:30 output of this stage available in memory 60:32 although not on not an HDFS but in the 60:34 memory of these workers and the paper 60:39 never talks about this possibility and 60:45 I'm not really sure what's going on 60:46 maybe that would work or maybe the fact 60:49 that the cash requests are merely 60:52 advisory and maybe evicted if the 60:56 workers run out of space means that 60:59 calling cash doesn't give you it isn't 61:01 like a reliable directed to make sure 61:04 the data really is available it's just 61:06 well it'll probably be available on most 61:08 nodes but not all nodes because remember 61:10 even a single node loses its data and 61:16 we're gonna have to do a bunch of 61:17 recomputation so III I'm guessing that 61:22 persists with replication is a firm 61:25 directive to guarantee that the data 61:28 will be available even if there's a 61:29 failure 61:30 I don't really know it's a good question 61:40 alright okay so that's the programming 61:46 model and the execution model and the 61:48 failure strategy and by the way just a 61:52 beat on the failure strategy a little 61:54 bit more the way these systems do 61:56 failure recovery is it's not a minor 62:00 thing as as people build bigger and 62:04 bigger clusters with thousands and 62:06 thousands of machines you know the 62:08 probability that job will be interrupted 62:10 by at least one worker failure it really 62:13 does start to approach one and so the 62:16 the designs recent designs intended to 62:20 run on big clusters have really been to 62:22 a great extent dominated by the failure 62:25 recovery strategy and that's for example 62:28 a lot of the explanation for why SPARC 62:31 insists that the transformations be 62:35 deterministic and why the are these its 62:39 rdd's are immutable because you know 62:44 that's what allows it to recover from 62:47 failure by simply recomputing one 62:49 partition instead of having to start the 62:51 entire computation from scratch and 62:53 there have been in the past plenty of 62:56 proposed sort of cluster big data 62:59 execution models in which there really 63:02 was mutable data and in which 63:03 computations could be non-deterministic 63:06 make if you look up distributed shared 63:08 memory systems those all support mutable 63:10 data and they support non-deterministic 63:14 execution but because of that they tend 63:18 not to have a good failure strategy so 63:20 you know thirty years ago when a big 63:22 cluster was for computers none of this 63:25 mattered because the failure probability 63:26 was little very low and so many 63:29 different kinds of computation models 63:32 seemed reasonable then but as the 63:36 clusters have grown to be hundreds and 63:37 thousands of workers really the only 63:41 models that have survived are ones for 63:44 which you can devise a very efficient to 63:47 failure recovery strategy that does not 63:49 require backing all the way up to the 63:52 beginning 63:53 and restarting the paper talks about 63:56 this a little bit when it's criticizing 63:57 I'm distributed shared memory and it's a 64:01 very valid criticism I bet it's a big 64:05 design constraint okay so the sparks not 64:14 perfect for all kinds of processing it's 64:17 really geared up for batch processing of 64:19 giant amounts of data bulk bulk data 64:23 processing so if you have terabytes of 64:25 data and you want to you know chew away 64:27 on it for for a couple hours smart great 64:31 if you're running a bank and you need to 64:34 process bank transfers or people's 64:37 balance queries then SPARC is just not 64:40 relevant to that kind of processing 64:43 known or to sort of typical websites 64:45 where I log into you know I access 64:48 Amazon and I want to order some paper 64:52 towels and put them into my shopping 64:53 cart SPARC is not going to help you 64:55 maintain this part the shopping cart 64:58 SPARC may be useful for analyzing your 65:00 customers buying habits sort of offline 65:03 but not for sort of online processing 65:07 the other sort of kind of a little more 65:11 close to home situation that spark in 65:14 the papers not so great at is stream 65:15 processing i SPARC definitely assumes 65:18 that all the input is already available 65:19 but in many situations the input that 65:22 people have is really a stream of input 65:26 like they're logging all user clicks on 65:28 their web sites and they want to analyze 65:30 them to understand user behavior you 65:32 know it's not a kind of fixed amount of 65:35 data is really a stream of input data 65:36 and you know SPARC as in describing the 65:40 paper doesn't really have anything to 65:42 say about processing streams of data but 65:46 it turned out to be quite close to home 65:47 for people who like to use spark and and 65:51 now there's a variant of SPARC called 65:52 spark streaming that that is a little 65:54 more geared up to kind of processing 65:57 data as it arrives and you know sort of 65:59 breaks it up into smaller batches and 66:01 runs in a batch at a time to spark 66:05 so it's good for a lot of bad stuff but 66:07 that's certainly on to be thing right to 66:10 wrap up the UH you should view spark as 66:13 a kind of evolution after MapReduce and 66:16 I may fix some expressivity and 66:19 performance sort of problems or that 66:25 MapReduce has what a lot of what SPARC 66:28 is doing is making the data flow graph 66:31 explicit sort of he wants you to think 66:34 of computations in the style of figure 66:36 three of entire lineage graphs stages of 66:39 computation and the data moving between 66:41 these stages and it does optimizations 66:44 on this graph and failure recovery is 66:47 very much thinking about the lineage 66:49 graph as well so it's really part of a 66:52 larger move and big data processing 66:54 towards explicit thinking about the data 66:57 flow graphs as a way to describe 67:00 computations a lot of the specific win 67:04 and SPARC have to do with performance 67:06 part of the prepend these are 67:09 straightforward but nevertheless 67:11 important some of the performance comes 67:13 from leaving the data in memory between 67:15 transformations rather than you know 67:18 writing them to GFS and then reading 67:20 them back at the beginning of the next 67:21 transformation which you essentially 67:23 have to do with MapReduce and the other 67:25 is the ability to define these data sets 67:28 these are Dedes and tell SPARC to leave 67:32 this RDD in memory because I'm going to 67:34 reuse it again and subsequent stages and 67:37 it's cheaper to reuse it than it is to 67:39 recompute it and that sort of a thing 67:41 that's easy and SPARC and hard to get at 67:45 in MapReduce and the result is a system 67:48 that's extremely successful and 67:51 extremely widely used and if you deserve 67:55 real success okay that that's all I have 67:59 to say and I'm happy to take questions 68:01 if anyone has them 68:09 you