字幕記錄


00:06
all right today today we're going to
00:13
talk about spark spark say essentially a
00:17
successor to MapReduce you can think of
00:21
it as a kind of evolutionary step in
00:24
MapReduce and one reason we're looking
00:28
at it is that it's widely used today for
00:31
data center computations that's turned
00:34
out to be very popular and very useful
00:37
one interesting thing it does which will
00:40
pay attention to is that it it
00:41
generalizes the kind of two stages of
00:43
MapReduce the map introduced into a
00:47
complete notion of multi-step data flow
00:51
graphs that and this is both helpful for
00:57
flexibility for the programmer it's more
00:59
expressive and it also gives the system
01:02
the SPARC system a lot more to chew on
01:04
when it comes to optimization and
01:07
dealing with faults dealing with
01:09
failures and also for the from the
01:12
programmers point of view it supports
01:14
iterative applications application said
01:16
you know loop over the data effectively
01:19
much better than that produced us you
01:21
can cobble together a lot of stuff with
01:24
multiple MapReduce applications running
01:27
one after another but it's all a lot
01:30
more convenient in and SPARC okay so I
01:36
think I'm just gonna start right off
01:38
with an example application this is the
01:41
code for PageRank and I'll just copy
01:47
this code with a few a few changes from
01:52
some sample source code in the
01:57
in the spark source I guess it's
02:01
actually a little bit hard to read let
02:02
me just give me a second law try to make
02:04
it bigger
02:14
all right okay so if this is if this is
02:18
too hard to read is there's a copy of it
02:20
in the notes and it's an expansion of
02:22
the code and section 3 to 2 in the paper
02:26
a page rank which is a algorithm that
02:31
Google uses pretty famous algorithm for
02:33
calculating how important different web
02:38
search results are what PageRank is
02:42
trying to do
02:43
well actually PageRank is sort of widely
02:46
used as an example of something that
02:49
doesn't actually work that well and
02:51
MapReduce and the reason is that
02:53
PageRank involves a bunch of sort of
02:56
distinct steps and worse PageRank
02:58
involves iteration there's a loop in it
03:01
that's got to be run many times and
03:03
MapReduce just has nothing to say about
03:06
about iteration the input the PageRank
03:12
for this version of PageRank is just a
03:15
giant collection of lines one per link
03:20
in the web and each line then has two
03:23
URLs the URL of the page containing a
03:26
link and the URL of the link that that
03:28
page points to and you know if the
03:31
intent is that you get this file from by
03:33
crawling the web and looking at all the
03:36
all collecting together all the links in
03:38
the web's the input is absolutely
03:40
enormous and as just a sort of silly
03:46
little example for us from when I
03:49
actually run this code I've given some
03:53
example input here and this is the way
03:55
the impro would really look it's just
03:56
lines each line with two URLs and I'm
03:59
using u1 that's the URL of a page and u3
04:03
for example as the URL of a link that
04:07
that page points to just for convenience
04:09
and so the web graph that this input
04:12
file represents there's only three pages
04:15
in it one two three I could just
04:21
interpret the links there's a link from
04:22
one two three
04:24
there's a link from one back to itself
04:27
there's a web link from two to three
04:30
there's a web link from two back to
04:32
itself and there's a web link from three
04:35
to one just like a very simple graph
04:39
structure what PageRank is trying to do
04:42
it's you know estimating the importance
04:45
of each page what that really means is
04:47
that it's estimating the importance
04:50
based on whether other important pages
04:53
have links to a given page and what's
04:56
really going on here is this kind of
04:58
modeling the estimated probability that
05:01
a user who clicks on links will end on
05:05
each given page so it has this user
05:08
model in which the user has a 85 percent
05:11
chance of following a link from the
05:14
users current page following a randomly
05:17
selected link from the users current
05:19
page to wherever that link leads and a
05:21
15% chance of simply switching to some
05:25
other page even though there's not a
05:27
link to it as you would if you you know
05:29
entered a URL directly into the browser
05:33
and the idea is that the he drank
05:38
algorithm kind of runs this repeatedly
05:43
it sort of simulates the user looking at
05:45
a page and then following a link and
05:48
kind of adds the from pages importance
05:51
to the target pages importance and then
05:53
sort of runs this again and it's going
05:55
to end up in the system like page rank
06:00
on SPARC it's going to kind of run this
06:02
simulation for all pages in parallel it
06:06
or literately
06:09
the and the idea is that it's going to
06:13
keep track the algorithms gonna keep
06:14
track of the page rank of every single
06:16
page or every single URL and update it
06:19
as it sort of simulates random user
06:22
clicks I mean that eventually that those
06:24
ranks will converge on kind of the true
06:27
final values now
06:31
because it's iterative although you can
06:34
code this up in rapid MapReduce it's a
06:37
pain it can't be just a single MapReduce
06:39
program it has to be multiple you know
06:45
multiple calls to a MapReduce
06:48
application where each call sort of
06:51
simulates one step in the iteration so
06:53
you can do in a MapReduce but it's a
06:55
pain and it's also kind of slope because
06:58
MapReduce it's only thinking about one
07:00
map and one reduce and it's always
07:02
reading its input from the GFS from disk
07:05
and the GFS filesystem and always
07:07
writing its output which would be this
07:09
sort of updated per page ranks every
07:12
stage also writes those updated per page
07:17
ranks to files in GFS also so there's a
07:19
lot of file i/o if you run this as sort
07:23
of a sequence of MapReduce applications
07:26
all right so we have here this sum
07:31
there's an a PageRank code that came
07:32
with um came a spark I'm actually gonna
07:35
run it for you I'm gonna run the whole
07:38
thing for you
07:40
this code shown here on the input that
07:42
I've shown just to see what the final
07:44
output is and then I'll look through and
07:46
we're going to step by step and
07:52
show how it executes alright so here's
07:56
the you should see a screen share now at
08:02
a terminal window and I'm showing you
08:05
the input file then I got a hand to this
08:10
PageRank program and now here's how I
08:14
read it I've you know I've downloaded a
08:17
copy of SPARC to my laptop it turns out
08:19
to be pretty easy and if it's a pre
08:23
compiled version of it I can just run it
08:27
just runs in the Java Virtual Machine I
08:29
can run it very easily so it's actually
08:30
doing downloading SPARC and running
08:33
simple stuff turns out to be pretty
08:35
straightforward so I'm gonna run the
08:37
code that I show with the input that I
08:40
show and we're gonna see a lot of sort
08:43
of junk error messages go by but in the
08:48
end support runs the program and prints
08:52
the final result and we get these three
08:53
ranks for the three pages I have and
08:56
apparently page one has the highest rank
09:02
and I'm not completely sure why but
09:06
that's what the algorithm ends up doing
09:09
so you know of course we're not really
09:10
that interested in the algorithm itself
09:13
so much as how we execute arc execute
09:18
sit all right so I'm gonna hand to
09:26
understand what the programming model is
09:29
and spark because it's perhaps not quite
09:33
what it looks like I'm gonna hand the
09:36
program line by line to the SPARC
09:40
interpreter so you can just fire up this
09:44
spark shell thing and type code to it
09:49
directly so I've sort of prepared a
09:53
version of the MapReduce program that I
09:57
can run a line at a time here so the
10:01
first line is this line in which it
10:05
reads the or asking SPARC to read this
10:08
input file and it's you know the input
10:11
file I showed with the three pages in it
10:16
okay so one thing there notice here is
10:19
is that when Sparky's a file what is
10:23
actually doing is reading a file from a
10:27
GFS like distributed file system and
10:29
happens to be HDFS the Hadoop file
10:33
system but this HDFS file system is very
10:36
much like GFS so if you have a huge file
10:38
as you would with got a file with all
10:40
the URLs all the links and the web on it
10:44
on HDFS is gonna split that file up
10:46
among lots and lots you know bite by
10:49
chunks it's gonna shard the file over
10:51
lots and lots of servers and so what
10:54
reading the file really means is that
10:57
spark is gonna arrange to run a
11:00
computation on each of many many
11:02
machines each of which reads one chunk
11:05
or one partition of the input file and
11:10
in fact actually the system ends up or
11:13
HDFS ends up splitting the file big
11:16
files typically into many more
11:17
partitions
11:19
then there are worker machines and so
11:22
every worker machine is going to end up
11:23
being responsible for looking at
11:26
multiple partitions of the input files
11:28
this is all a lot like the way map works
11:33
mapreduce okay so this is the first line
11:37
in the program and you may wonder what
11:41
the variable lines actually hold so in
11:44
printed the result of lines but with the
11:46
lines points - it turns out that even
11:50
though it looks like we've typed a line
11:53
of code that's asking the system to read
11:55
a file in fact it hasn't read the file
11:58
and won't read the file for a while what
12:02
we're really building here with this
12:03
code what this code is doing is not
12:07
causing the input to be processed
12:09
instead what this code does is builds a
12:13
lineage graph it builds a recipe for the
12:16
computation we want like a little kind
12:19
of lineage graph that you see in Figure
12:20
three in the paper so what this code is
12:23
doing it's just building the lineage
12:25
graph building the computation recipe
12:27
and not doing the computation when the
12:30
computations only gonna actually start
12:32
to happen once we execute what the paper
12:35
calls an action which is a function like
12:39
collect for example to finally tell mark
12:42
oh look I actually want the output now
12:44
please go and actually execute the
12:48
lineage graph and tell me what the
12:50
result is
12:51
so what lines holds is actually a piece
12:53
of the lineage graph not a result now in
12:58
order to understand what the computation
13:01
will do when we finally run it we could
13:03
actually ask SPARC at this point we can
13:07
ask the interpreter to please go ahead
13:09
and tell us what you know I actually
13:14
execute the lineage graph up to this
13:16
point and tell us what the results are
13:19
so and you do that by calling an action
13:22
I'm going to call collect which so just
13:24
prints out all the results of executing
13:27
the lineage graph so far and what we're
13:31
expecting to see here is you know all
13:32
we've asked it to do so far the lineage
13:34
graph just says please read a file so
13:36
we're expecting to see that the final
13:38
output is just the contents of the file
13:40
and indeed that's what we get and what
13:44
what
13:46
this lineage graph this one
13:48
transformation lineage graph is results
13:52
in is just the sequence of lines one at
13:57
a time so it's really a set of lines a
14:01
set of strings each of which contains
14:03
one line of the input alright so that's
14:05
the first line of the program the second
14:10
line is is collect essentially just
14:17
just-in-time compilation of the symbolic
14:19
execution chain yeah yeah yeah yeah
14:23
that's what's going on so what collect
14:25
does is it actually huge amount of stuff
14:28
happens if you call collect
14:30
it tells SPARC to take the lineage graph
14:34
and produce java bytecodes
14:37
that describe all the various
14:39
transformations you know which in this
14:40
case it's not very much since we're just
14:42
reading a file but so SPARC well when
14:45
you call collect SPARC well figure out
14:48
where the data is you want by looking
14:50
HDFS it'll you know just pick a set of
14:53
workers to run to process the different
14:57
partitions of the input data it'll
14:59
compile the lineage graph and we reach
15:01
transformation in the lineage graph into
15:03
java bytecodes it sends the byte codes
15:05
out to the all the worker machines that
15:08
spark chose and those worker machines
15:10
execute the byte codes and the byte
15:15
codes say oh you know please read tell
15:18
each worker to read it's partition at
15:19
the input and then finally collect goes
15:24
out and fetches all the resulting data
15:27
back from the workers and so again none
15:32
of this happens until you actually
15:33
wanted an action and we sort of
15:34
prematurely run collect now you wouldn't
15:37
ordinarily do that I just because I just
15:39
want to see what the the output is to
15:40
understand what the transformations are
15:43
doing okay
15:46
if you look at the code that I'm showing
15:51
the second line is this map call so the
15:58
leave so line sort of refers to the
16:01
output of the first transformation which
16:03
is the set of strings correspond to
16:06
lines in the input we're gonna call map
16:09
we've asked the system call map on that
16:11
and what map does is it runs a function
16:13
over each element of the input that is
16:16
in this case or each line of the input
16:18
and that little function is the S arrow
16:22
whatever which basically describes a
16:25
function that calls the split function
16:27
on each line split just takes a string
16:30
and returns a array of strings broken at
16:34
the places where there are spaces and
16:37
the final part of this line that refers
16:39
to parts 0 & 1 says that for each line
16:42
of input we want to at the output of
16:44
this transformation be the first string
16:48
on the line and then the second string
16:51
of the line so we're just doing a little
16:52
transformation to turn these strings
16:54
into something that's a little bit
16:55
easier to process and again at a
16:59
curiosity
17:00
I'm gonna call collect on links one just
17:02
to verify that we understand what it
17:04
does and you can see where as lines held
17:09
just string lines links one now holds
17:13
pairs of strings of from URL and to URL
17:18
one for each link and when this executes
17:26
this map executes it can execute totally
17:28
independently on each worker on its own
17:30
partition of the input because it's just
17:32
considering each line independently
17:34
there's no interaction involved between
17:37
different lines or different partitions
17:39
these are it's running if these this map
17:41
is a purely local operation on each
17:45
input record so can run totally in
17:47
parallel on all the workers on all their
17:49
partitions ok the next line in the
17:55
program is this called the distinct and
17:59
what's going on here is that we only
18:02
want to count each link once so if a
18:04
given page has multiple links to another
18:07
page we want to only consider one of
18:10
them for the purposes of PageRank and so
18:15
this just looks for duplicates now if
18:17
you think about what it actually takes
18:19
to look for duplicates in a you know
18:23
multi terabyte collection of data items
18:28
it's no joke
18:30
because the data items are in some
18:32
random order and the input and what
18:34
distinct needs to do since an e sirup
18:36
replace each duplicated input with a
18:39
single input distinct needs to somehow
18:42
bring together all of the items that are
18:45
identical and that's going to require
18:48
communication remember that all these
18:49
data is spread out over all the workers
18:51
we want to make sure that any you know
18:54
that we bring we sort of shuffle the
18:55
data around so that any two items that
18:58
are identical or on the same worker so
18:59
that that worker can do this I'll wait a
19:00
minute there's three of these I'm gonna
19:02
replace it these three with a single one
19:04
I mean that means that distinct when it
19:06
finally comes to execute requires
19:09
communication it's a shuffle and so the
19:12
shuffle is going to be driven by either
19:13
hashing the items the hashing the items
19:17
to pick the worker that will process
19:18
that item and then sending the item
19:20
across the network or you know possibly
19:22
you could be implemented with a sort or
19:24
the system sort of sorts all the input
19:26
and then splits up the sorted input
19:31
overall the workers I'd actually don't
19:35
know which it does but anyway I'm gonna
19:37
require a lot of computation in this
19:40
case however almost fact nothing
19:42
whatsoever happens because there were no
19:44
duplicates and sorry whoops
19:49
links to all right so anyone collect and
19:54
the links to which is the output a
19:58
distinct is basically except for order
20:02
identical two links one which was the
20:05
input to that transformation and the
20:07
orders change because of course it has
20:09
to hash or sort or something all right
20:12
the next the next transformation is is
20:19
grouped by key and here what we're
20:24
heading towards is we want to collect
20:27
all of the links it turns out for the
20:33
computation with little C we want to
20:35
collect together all the links from a
20:36
given page into one place so the group
20:40
by key is gonna group by it's gonna move
20:43
all the records all these from two URL
20:45
pairs it's gonna group them by the from
20:47
URL that is it's gonna bring together
20:51
all the links that are from the same
20:55
page and it's gonna actually collapse
20:57
them down into the whole collection of
20:59
links from each page is gonna collapse
21:02
them down into a list of links into that
21:04
pages URL plus a list of the links that
21:07
start at that page and again this is
21:10
gonna require communication although
21:17
spark I suspect spark is clever enough
21:19
to optimize this because the distinct
21:21
already um put all records with the same
21:27
from URL on the same worker the group by
21:31
key could easily and may well just I'm
21:35
not have to communicate at all because
21:36
it can observe that the data is already
21:38
grouped by the from URL key all right so
21:41
let's print links three
21:44
let's run collect actually drive the
21:47
computation and see what the result is
21:52
and indeed what we're looking at here is
21:55
an array of couples where the first part
21:59
of each tuple is the URL the from page
22:01
and the second is the list of links that
22:05
start at that front page and so you can
22:08
see the YouTube has a link to you two
22:10
and three you three as a link to just u
22:12
1 and u 1 has a link to u 1 & u 3 okay
22:22
so that's link 3 now the iteration is
22:29
going to start in a couple lines from
22:30
here it's gonna use these things over
22:32
and over again each iteration of the
22:35
loop is going to use this this
22:39
information in links 3 in order to sort
22:44
of propagate probabilities in order to
22:46
sort of simulate these user clicking I'm
22:49
from from all pages to all other link to
22:52
two pages so this length stuff is these
22:55
links data is gonna be used over and
22:57
over again and we're gonna want to save
22:58
it it turns out that each time I've
23:00
called collect so far spark has
23:02
re-execute 'add the computation from
23:05
scratch so every call to collect I've
23:06
made has involved spark rereading the
23:09
input file re running that first map
23:12
rerunning the distinct and if I were to
23:14
call collect again it would rerun this
23:17
route by key but we don't want to have
23:18
to do that over and over again on sort
23:20
of multiple terabytes of links for each
23:25
loop iteration because we've computed it
23:28
once and it's gonna state this list of
23:29
links is gonna stay the same we just
23:31
want to save it and reuse it so in order
23:34
to tell spark that look we want to use
23:38
this over and over again the programmer
23:39
is required to explicitly what the paper
23:43
calls persist this data and in fact
23:48
modern spark the function you call
23:51
not persist if you want to sleep in a
23:53
memory but but it's called cash and so
23:55
links for is just identical the links we
23:59
accept with the annotation that we'd
24:03
like sparked keep links for in memory
24:06
because we're gonna use it over and over
24:07
again ok so that the last thing we need
24:14
to do before the loop starts is we're
24:16
gonna have a set of page ranks for every
24:20
page indexed by source URL and we need
24:23
to initialize every pages rank it's not
24:28
really ranks here it's kind of
24:29
probabilities we're gonna initialize all
24:33
the probabilities to one so they all
24:35
start out with a probability one with
24:38
the same rank but we're gonna well we're
24:41
gonna actually you code that looks like
24:43
it's changing ranks but in fact when we
24:49
execute the loop in the code I'm showing
24:52
it really produces a new version of
24:54
ranks for every loop iteration that's
24:56
updated to reflect the fact that the
25:00
code algorithm is kind of pushed page
25:02
ranks from each from each P
25:07
to the page is that it links to so let's
25:10
print ranks also to see what's inside
25:13
it's just a mapping from URL from source
25:17
URL to the current page rank value for
25:20
every page ok not gonna start executing
25:23
inside the spark allow the user to
25:27
request more fine-grained scheduling
25:30
primitives than cache that is to control
25:32
where that is stored or how the
25:33
computations are performed well yeah so
25:38
cache cache is a special case of a more
25:41
general persist call which can tell
25:44
spark look I want to you know save this
25:46
data in memory or I want to save it in
25:49
HDFS so that it's replicated and all
25:52
survived crashes so you got a little
25:53
flexibility there in general you know we
25:58
didn't have to say anything about the
26:00
partitioning in this code and spark will
26:04
just choose something at first the
26:07
partitioning is driven by the
26:09
partitioning of the original input files
26:11
but when we run transformations that had
26:16
to shuffle had to change the
26:17
partitioning like distinct it does that
26:19
and group by key does that spark will do
26:21
something internally that if we don't do
26:25
any we don't say anything it'll just
26:26
pick some scheme like hashing the keys
26:28
over the available workers for example
26:30
but you can tell it look you know I it
26:34
turns out that this particular way of
26:35
partitioning the data you know use a
26:39
different hash function or maybe
26:40
partitioned by ranges instead of hashing
26:42
you can tell it if you like more clever
26:46
ways to control the partitioning okay so
26:53
I'm about to start
26:55
the first thing the loop does and I hope
26:57
you can see the the code on line 12 we
27:02
actually gonna run this join this is the
27:06
first statement of the first iteration
27:08
of the loop with this joint is doing is
27:12
joining the links with the ranks and
27:17
what that does is pull together the
27:20
corresponding entries in the links which
27:22
said for every URL what is the point
27:24
what does it have links to and I'm sort
27:28
of putting together the links with the
27:29
ranks and but the rank says is for every
27:31
URL what's this current PageRank so now
27:33
we have together and a single item for
27:38
every page
27:39
both what its current PageRank is and
27:41
what links it points to because we're
27:43
gonna push every pages current PageRank
27:47
to all the pages it appoints to and
27:50
again this joint is uh is what the paper
27:52
calls a wide transformation because it
27:57
doesn't it's not a local the I mean it
28:04
needs to it may need to shuffle the data
28:07
by the URL key in order to bring
28:10
corresponding elements of links and
28:13
ranks together now in fact I believe
28:17
spark is clever enough to notice that
28:19
links and ranks are already partitioned
28:23
by key in the same way actually that
28:27
assumes that it cleverly created links
28:30
well when we created ranks its assumes
28:33
that it cleverly created
28:34
ranks using the same hash scheme as used
28:39
when it created links but if it was that
28:41
clever then it will notice that links
28:43
and ranks are passed in the same way
28:45
that is to say that the links ranks are
28:48
already on the same workers or sorry the
28:53
corresponding partitions with the same
28:55
keys are already in the same workers and
28:57
hopefully spark will notice that and not
29:00
have to move any data around if
29:01
something goes wrong though in links and
29:03
ranks are partitioned in different ways
29:04
then data will have to move at this
29:05
point
29:06
to join up corresponding keys in the two
29:10
and the two rdd's alright so JJ
29:15
contained now contains both every pages
29:17
rank and every pages list of links as
29:25
you can see now we have a even more
29:28
complex data structure it's an array
29:31
with an element per page with the pages
29:34
URL with a list of the links and the one
29:37
point over there is the page you choose
29:40
current rank and these are all all this
29:45
information is any sort of a single
29:47
record that has all this information for
29:48
each page together where we need it
29:52
alright the next step is that we're
29:56
gonna figure out every page is gonna
29:58
push a fraction of its current page rank
30:02
to all the pages that it links to it's
30:04
kind of sort of divided up its current
30:05
page rank among all the pages it links
30:07
to
30:11
and that's what this contribs does you
30:16
know basically what's going on is that
30:18
it's a one another one call to map and
30:23
we're mapping over the for each page
30:27
were running map over the URLs that that
30:29
pages points to and for each page it
30:32
points to we're just calculating this
30:37
number which is the from pages current
30:39
rank divided by the total number of
30:41
pages that points to so this sort of
30:44
figured you know creates a mapping from
30:47
link name to one of the many
30:50
contributions to that pages new page
30:55
rank and we can sneak peek it what this
31:04
is gonna produce I think is a much
31:07
simpler thing it just as a list of URLs
31:10
and contributions to the URLs page ranks
31:13
and there's there's more there's you
31:15
know more than one record for each URL
31:16
here because there's gonna for any given
31:19
page there's gonna be a record here for
31:21
every single link that points to it
31:22
indicating this contribution of from
31:27
whatever that link came from to this
31:29
page to this pages new updated PageRank
31:32
what has to happen now is that we need
31:35
to sum up for every page we need to sum
31:38
up the PageRank contributions for that
31:42
page that are in contribs so again we
31:44
going to need to do a shuffle here it's
31:46
gonna be a wide a transformation with a
31:49
wide input because we need to bring
31:50
together all of the elements of contribs
31:55
for each page we need to bring together
31:57
and to the same worker to the same
31:59
partition so they can all be summed up
32:03
and the way that's done the bay PageRank
32:07
does that is with this reduced by key
32:10
call would reduce spike he does is
32:15
it first of all it brings together all
32:17
the records with the same key and then
32:19
sums up the second element of each one
32:24
of those records for a given key and
32:26
produces as output the key which is a
32:30
URL and the sum of the numbers which is
32:33
the updated PageRank there's actually
32:39
two transformations here the first ones
32:40
is reduced by key and the second is this
32:43
map values which and and this is the
32:46
part that implements the 15% probability
32:49
of going to a random page and the 85%
32:52
chance of following a link all right
33:00
let's look at ranks by the way even
33:02
though we've assigned two ranks here um
33:04
what this is going to end up doing is
33:06
creating an entirely new transformation
33:08
I'm so not it's not changing the value
33:12
is already computed or when it comes to
33:14
executing this it won't change any
33:16
values are already computed it just
33:17
creates a new a new transformation with
33:20
new output and we can see what's gonna
33:27
happen in indeed we now have member
33:29
ranks originally was just a bunch of
33:32
pairs of URL PageRank now again we
33:35
appears if you are I'll page rank
33:37
another different we'd actually updated
33:38
them sort of changed them by one step
33:43
and I don't know if you remember the
33:45
original PageRank values we saw but
33:48
these are closer to those final output
33:51
that we saw then the original values of
33:54
all one are okay so that was one
33:58
iteration of the algorithm when the loop
34:01
goes back up to the top it's gonna do
34:02
the same join flat map and reduce by key
34:08
and each time it's again you know what
34:13
the loop is actually doing is producing
34:15
this lineage graph and so it's not
34:18
updating the variables that are
34:20
mentioned in the loop it's really
34:21
creating essentially appending new
34:25
transformation nodes to the lineage
34:27
graph that it's building
34:30
but I've only run that Elite once after
34:34
the loop and then now this is what the
34:37
real code does the real code actually
34:39
runs collect at this point and so they
34:42
were in the real PageRank implementation
34:43
only at this point with the computation
34:46
even start because of the call to
34:49
collect here and I go off and read the
34:50
end burden we're on the input through
34:52
all these transformations and shuffles
34:54
for the wide dependencies and finally
34:57
collect the output together on the
34:59
computer that's running this program by
35:02
the way the computer that runs the
35:03
program that the paper calls it the
35:05
driver the driver computer is the one
35:07
that actually runs this scallop program
35:09
that's kind of driving the spark
35:13
computation and then the program takes
35:15
this output variable and runs it through
35:18
a nice nicely formatted print on each of
35:26
the records in the collect up okay so
35:35
that's the
35:39
kind of style of programming that people
35:41
use for Scala and I mean for for spark
35:51
went one thing to note here relative to
35:54
MapReduce is that this program well you
35:57
know and look looks a little bit complex
35:59
but the fact is that this program is
36:02
doing the work of many many MapReduce or
36:07
doing an amount of work that would
36:09
require many separate MapReduce programs
36:12
in order to implement so you know it's
36:16
21 lines and maybe you used two
36:18
MapReduce programs that are simpler than
36:20
that but this is doing a lot of work for
36:22
21 lines and it turns out that this is
36:25
you know this is sort of a real
36:27
algorithm to so it's like a pretty
36:29
concise and easy program easy to program
36:32
way to express vast Big Data
36:37
computations you know people like pretty
36:42
successful okay so again
36:50
just want to repeat that until the final
36:52
collect or this code is doing is
36:54
generating a lineage graph and not
36:56
processing the data and the the lineage
36:58
graph that it produces actually the
37:01
paper
37:01
I'm just copied this from the paper this
37:05
is what the lineage graph looks like
37:06
it's you know this is all that the
37:09
program is producing it's just this
37:11
graph until the final collect and you
37:14
can see that it's a sequence of these
37:16
processing stage where we read the file
37:20
to produce links and then completely
37:21
separately we produce these initial
37:23
ranks and then there's repeated joins
37:26
and reduced by key pairs each loop
37:34
iteration produces a join and a each of
37:41
these pairs is one loop iteration and
37:42
you can see again that the loop is
37:44
appended more and more nodes to the
37:46
graph rather than what it is not doing
37:49
in particular it is not producing a
37:51
cyclic graph the loop is producing all
37:56
these graphs are a cyclic another thing
37:59
to notice that you wouldn't have seen a
38:01
MapReduce is that this data here which
38:03
was the data that we cashed that we
38:05
persisted is used over and over again
38:07
and every loop iteration and so it
38:09
sparks going to keep this in memory and
38:12
it's going to consult it multiple times
38:20
alright so it actually happens during
38:26
execution what is the execution look
38:28
like so again the the assumption is that
38:36
the data the input data starts out kind
38:39
of pre partitioned by over in HDFS
38:45
we assume our one file it's our input
38:48
files already split up into lots of you
38:51
know 64 megabyte or whatever it may
38:53
happen pieces in HDFS spark knows that
38:58
when you started you actually call
39:01
collect the start of computation spark
39:02
knows that the input data is already
39:03
partitioned HDFS and it's gonna try to
39:08
split up the work the workers in a
39:11
corresponding way so if it knows that
39:13
there's I actually don't know what the
39:15
details are a bit it might actually try
39:19
to run the computation on the same
39:21
machines that store the HDFS data or it
39:25
may just set up a bunch of workers to
39:31
read each of the HDFS partitions and
39:35
again there's likely to be more than one
39:37
partition per per worker so we have the
39:41
input file and the very first thing is
39:45
that each worker reads as part of the
39:50
input file so this is the read their
39:53
file read if you remember the next step
39:55
is a map where the each worker supposed
39:57
to map a little function that splits up
39:59
each line of input into a from two
40:02
linked tupple um but this is a purely
40:06
local operation and so it can go on in
40:08
the same worker so we imagine that we
40:10
read the data and then in the very same
40:13
worker spark is gonna do that initial
40:16
map so you know I'm drawing an arrow
40:19
here's really an arrow from each worker
40:21
to itself so there's no network
40:22
communication involved indeed it's just
40:24
you know we run the first read and the
40:28
output can be directly fed to that
40:30
little map function and in fact this is
40:33
that that initial map in fact spark
40:39
certainly streams the data record by
40:40
record through these transformations so
40:43
instead of reading the entire input
40:45
partition and then running the map on
40:47
the entire input partition SPARC reads
40:52
the first record or maybe the first just
40:53
couple of records and then runs the map
40:56
on just sort of all I'm each record in
40:58
fact runs each record of E if it was
41:02
many transformations as it can before
41:05
going on and reading the next little bit
41:06
from the file and that's so that it
41:08
doesn't have to store yes these files
41:10
could be very large it isn't one half so
41:13
like store the entire input file it's
41:14
much more efficient just to process it
41:16
record by record okay so there's a
41:18
question so the first node in each chain
41:22
is the worker holding the HDFS chunks
41:24
and the remaining nodes in the chain are
41:26
the nodes in the lineage oh yeah I'm
41:28
afraid I've been a little bit confusing
41:29
here I think the way to think of this is
41:32
that so far all this happen is happening
41:35
on it on individual workers so this is
41:37
worker one maybe this is another worker
41:40
and
41:45
each worker is sort of proceeding
41:48
independently and I'm imagining that
41:50
they're all running on the same machines
41:53
that stored the different partitions of
41:55
the HTTPS fob but there could be Network
41:57
communication here to get from HDFS to
41:59
the to the responsible worker but after
42:02
that it's very fast kind of local
42:04
operations all right and so this is what
42:17
happens for the with the people called
42:20
the narrow
42:23
dependencies that is transformations
42:25
that just look consider each record of
42:28
data independently without ever having
42:30
to worry about the relationship to other
42:33
records so by the way this is already
42:37
potentially more efficient than
42:39
MapReduce and that's because if we have
42:43
what amount to multiple map phases here
42:46
they just string together in memory
42:48
whereas MapReduce if you're not super
42:50
clever
42:51
if you run multiple MapReduce is even if
42:54
they're sort of degenerate map only
42:56
MapReduce applications each stage would
42:59
reduce input from G of s compute and
43:02
write its output back to GFS then the
43:04
next stage would be compute right so
43:07
here we've eliminated the reading
43:08
writing in it you know it's not a very
43:10
deep advantage but it sure helps
43:14
enormous Li for efficiency okay however
43:20
not all the transformations are narrow
43:23
not all just sort of read their input
43:26
record by record kind of with every
43:28
record independent from other records
43:30
and so what I'm worried about is the
43:32
distinct call which needed to know all
43:34
instances all records that had a
43:37
particular key similarly group by key
43:39
needs to know about all instances that
43:42
have a key join also it's gotta move
43:45
things around so that takes two inputs
43:50
needs to join together all keys from
43:53
both inputs so that this all records
43:54
from both inputs that are the same key
43:56
so there's a bunch of these non-local
43:58
transformations which the paper calls
44:01
wide transformations because they
44:04
potentially have to look at all
44:05
partitions of the input that's a lot
44:08
like reduce in MapReduce serve example
44:12
distinct exposing we're talking about
44:14
the distinct stage you know the distinct
44:18
is going to be run on multiple workers
44:20
also and no distinct works on each key
44:24
independently and so we can partition
44:27
the computation by key but the
44:31
data currently is not partitioned by key
44:33
at all actually isn't really partitioned
44:34
by anything but just sort of however
44:36
HDFS have my distorted so four distinct
44:41
we're gonna run distinct on all the word
44:44
partition and all the workers
44:46
partitioned by key but you know any one
44:50
worker needs to see all of the input
44:52
records with a given key which may be
44:54
spread out over all of the preceding
45:00
workers for the preceding transformation
45:04
and all of all of the you know they're
45:07
all for the workers are responsible for
45:09
different keys but the keys may be
45:10
spread out over
45:16
workers for the preceeding
45:19
transformation now in fact the workers
45:21
are the same typically it's gonna be the
45:23
same workers running the map is running
45:25
running the distinct but the data needs
45:27
to be moved between the two
45:28
transformations to bring all the keys
45:30
together and so what sparks actually
45:33
gonna do it's gonna take the output of
45:34
this map hash the each record by its key
45:38
and use that you know mod the number of
45:40
workers to select which workers should
45:42
see it and in fact the implementation is
45:46
a lot like your implementation of
45:48
MapReduce the very last thing that
45:51
happens in in the last of the narrow
45:55
stages is that the output is going to be
45:59
chopped up into buckets corresponding to
46:01
the different workers for the next
46:05
transformation where it's going to be
46:06
left waiting for them to fetch I saw the
46:10
scoop is that each of the workers run
46:13
the sort of as many stages all the
46:15
narrows stages they can through the
46:16
completion and store the output split up
46:19
into buckets when all of these are
46:21
finished then we can start running the
46:24
workers for the distinct transformation
46:27
whose first step is go and fetch from
46:30
every other worker the relevant bucket
46:32
of the output of the last narrow stage
46:35
and then we can run the distinct because
46:38
all the given keys are on the same
46:40
worker and they can all start producing
46:42
output themselves
46:48
all right now of course these Y
46:50
transformations are quite expensive the
46:52
now transformations are super efficient
46:54
because we're just sort of taking each
46:56
record and running a bunch of functions
46:58
on it totally locally the Y
47:00
transformations require pushing a lot of
47:02
data impact essentially all of the data
47:04
in for PageRank you know you get
47:06
terabytes of input data that means that
47:08
you know it's still the same data at
47:10
this stage because it's all the links
47:12
and then in the web so now we're pushing
47:15
terabytes and terabytes of data over the
47:17
network to implement this shuffle from
47:19
the output of the map functions to the
47:23
input of the distinct functions so these
47:24
wide transformations are pretty
47:28
heavyweight
47:28
a lot of communication and they're also
47:31
kind of computation barrier because we
47:33
have to wait all for all the narrow
47:35
processing to finish before we can go on
47:37
to the so there's wide transformation
47:45
all right that said the there are some
47:54
optimizations that are possible because
47:57
SPARC has a view SPARC creates the
47:59
entire lineage graph before it starts
48:04
any of the data processing so smart can
48:06
inspect the lineage graph and look for
48:08
opportunities for optimization and
48:10
certainly running all of if there's a
48:13
sequence of narrow stages running them
48:15
all in the same machine by basically
48:17
sequential function calls on each input
48:19
record that's definitely an optimization
48:21
that you can only notice if you sort of
48:24
see the entire lineage graph all at once
48:26
another optimization that
48:34
spark does is noticing when the data has
48:37
all has has already been partitioned due
48:40
to a wide shuffle that the data is
48:42
already partitioned in the way that it's
48:44
going to be needed for the next wide
48:47
transformation so in the in our original
48:51
program let's see I think we have two
48:57
wide transformations in a row distinct
49:00
requires a shuffle but group by key also
49:02
it's gonna bring together all the
49:05
records with a given key and replace
49:08
them with a list of for every key the
49:11
list of links you know starting at that
49:14
URL these are both wide operators they
49:16
both are grouping by key and so maybe we
49:19
have to do a shuffle for the distinct
49:21
but spark can cleverly recognize a high
49:24
you know that is already shuffled in a
49:25
way that's appropriate for a group by
49:26
key we don't have to do in other shuffle
49:28
so even though group by key is in
49:30
principle it could be a wide
49:32
transformation in fact I suspect spark
49:36
implements it without communication
49:38
because the data is already partitioned
49:39
by key so maybe the group by key
49:45
can be done in this particular case
49:48
without shuffling data without expense
49:53
of course it you know can only do this
49:56
because it produced the entire lineage
49:58
graph first and only then ran the
50:00
computation so this part gets a chance
50:02
to sort of examine and optimize and
50:07
maybe transform the graph
50:13
so that looks topic actually any any
50:16
questions about lineage graphs or how
50:20
things are executed
50:21
I feel free to interact the next thing I
50:28
want to talk about is fault tolerance
50:33
and here the you know these kind of
50:40
computations they're not the fault
50:41
tolerance are looking for is not the
50:42
sort of absolute fault tolerance you
50:45
would want with the database what you
50:46
really just cannot ever afford to lose
50:48
anything what you really want is a
50:49
database that never loses data here the
50:53
fault tolerance we're looking for is
50:54
more like well it's expensive if we have
50:58
to repeat the computation we can totally
51:00
repeat this computation if we have to
51:02
but you know it would take us a couple
51:04
of hours and that's irritating but not
51:06
the end of the world so we're looking to
51:09
you know tolerate common errors but we
51:12
don't have to certainly don't have to
51:15
having bulletproof ability to tolerate
51:19
any possible error so for example spark
51:26
doesn't replicate that driver machine if
51:29
the driver which was sort of controlling
51:31
the computation and knew about the
51:33
lineage graph of the driver crashes I
51:34
think you have to rerun the whole thing
51:36
but you know any only any one machine
51:38
only crashes maybe every few months so
51:40
that's no big deal
51:41
another thing to notice is that HDFS is
51:45
sort of a separate thing SPARC is just
51:48
assuming that the input is replicated in
51:52
a fault-tolerant way on HDFS and indeed
51:55
just just like GFS HDFS does indeed keep
51:58
multiple copies of the data on multiple
51:59
servers if one of them crashes can
52:02
soldier on with the other copy so the
52:05
input data is assumed to be to be
52:09
relatively fault tolerant and
52:11
what that means that at the highest
52:12
level is that spark strategy if one of
52:17
the workers fail is just to recompute
52:20
the whatever that worker was responsible
52:23
for to just repeat those computations
52:26
they were lost with the worker on some
52:29
other worker and on some other machine
52:32
so that's basically what's going on and
52:37
it you know it might take a while if you
52:40
have a long lineage like you would
52:42
actually get with PageRank because you
52:44
know PageRank with many iterations
52:45
produces a very long lineage graph one
52:50
way that spark makes it not so bad that
52:53
it has to be may have to be computer
52:55
everything from scratch if a worker
52:56
fails is that each workers actually
53:00
responsible for multiple partitions at
53:02
the input so spark can move those parts
53:06
move give each remaining worker just one
53:08
of the partitions and they'll be able to
53:10
basically paralyzed the recomputation
53:13
that was lost with the failed worker by
53:17
running each of its partitions on a on a
53:19
different worker in parallel so if all
53:22
else fails spark just goes back to the
53:24
beginning from being input and just
53:27
recomputes everything that was running
53:29
on that machine however and for now our
53:36
dependencies that's pretty much the end
53:38
of the story
53:39
however there actually is a problem with
53:42
the wide dependencies that makes that
53:43
story not as attractive as you might
53:48
hope so this is a topic here is failure
53:53
one failed node 1 failed worker
54:00
in a lineage graph that has wide
54:05
dependencies so the a reasonable or a
54:13
sort of sample graph you might have is
54:14
you know maybe you have a dependency
54:16
graph that's you know starts with some
54:19
power dependencies but then after a
54:26
while you have a wide dependency so you
54:29
got transformations that depend on all
54:37
the preceding transformations and then
54:39
some small narrow ones all right and you
54:44
know the game is that a single workers
54:45
fail and we need to reconstruct the
54:48
Maeby's field before we've gone to the
54:50
final action and produce the output so
54:55
we need to kind of reconstruct recompute
54:57
what was on this field work the the
55:02
damaging thing here is that ordinarily
55:05
as spark is executing along it you know
55:09
it executes each of the transformations
55:12
gives us output to the next
55:14
transformation but doesn't hold on to
55:16
the original output unless you unless
55:18
you happen to tell it to like the links
55:20
data is persisted with that cache call
55:24
but in general that data is not held on
55:27
to because now if you have a like the
55:31
PageRank lineage graph maybe dozens or
55:34
hundreds of steps long you don't want to
55:35
hold on to all that data it's way way
55:37
too much to fit in memory so as the
55:40
SPARC sort of moves through these
55:42
transformations it discards all the data
55:45
associated with earlier transformations
55:48
that means when we get here and if this
55:50
worker fails we need to we need to
55:53
restart its computation on a different
55:56
worker now so we can be the input and
55:58
maybe do the original narrow
56:01
transformations
56:04
they just depend on the input which we
56:05
have to reread but then if we get to
56:07
this y transformation we have this
56:08
problem that it requires input not just
56:11
from the same partition on the same
56:13
worker but also from every other
56:15
partition and these workers so they're
56:18
still alive have in this example have
56:20
proceeded past this transformation and
56:23
therefore discarded the output of this
56:28
transformation since it may have been a
56:31
while ago and therefore the input did
56:33
our recomputation needs from all the
56:36
other partitions doesn't exist anymore
56:39
and so if we're not careful that means
56:41
that in order to rebuild this the
56:44
computation on this field worker we may
56:46
in fact have to re execute this part of
56:51
every other worker as well as well as
56:55
the entire lineage graph on the failed
56:58
worker and so this could be very
57:01
damaging right if we're talking about oh
57:02
I mean I've been running this giant
57:05
spark job for a day and then one of a
57:07
thousand machines fails that may mean we
57:10
have to we know anything more clever
57:12
than this that we have to go back to the
57:13
very beginning on every one of the
57:15
workers and recompute the whole thing
57:18
from scratch no it's gonna be the same
57:21
amount of work is going to take the same
57:22
day to recompute a day's computation so
57:27
this would be unacceptable we'd really
57:30
like it so that if if one worker out of
57:32
a thousand crashes that we have to do
57:33
relatively little work to recover from
57:36
that and because of that spark allows
57:42
you to check point to make periodic
57:46
check points of specific transformation
57:48
so um so in this graph what we would do
57:52
is in the scallop program we would call
57:57
I think it's the persist call actually
57:59
we call the persist call with a special
58:00
argument that says look after you
58:04
compute the output of this
58:06
transformation please save the output to
58:09
HDFS
58:11
and so everything and then if something
58:14
fails the spark will know that aha the
58:18
output of the proceeding transformation
58:21
was safe th th d fs and so we just have
58:24
to read it from each DFS instead of
58:28
recomputing it on all for all partitions
58:30
back to the beginning of time um and
58:34
because HDFS is a separate storage
58:36
system which is itself replicated in
58:38
fault-tolerant the fact that one worker
58:40
fails you know the HDFS is still going
58:43
to be available even if a worker fails
58:49
so I think so for our example PageRank I
58:55
think what would be traditional would be
58:59
to tell
59:02
spark to check point the output to check
59:06
put ranks and you wouldn't even know you
59:08
can tell it to only check point
59:10
periodically so you know if you're gonna
59:12
run this thing for 100 iterations it
59:16
actually takes a fair amount of time to
59:18
save the entire ranks to HDFS because
59:24
again we're talking about terabytes of
59:25
data in total so maybe we would we can
59:28
tell SPARC look only check point ranks
59:31
to HDFS every every 10th iteration or
59:38
something to limit the expanse although
59:40
you know it's a trade-off between the
59:42
expensive repeatedly saving stuff to
59:44
disk and how much of a cost if a worker
59:48
failed you had to go back and redo it
59:55
Bertha's a question when we call
59:59
that does act as a checkpoint you know
60:02
okay so this is a very good question
60:03
which I don't know the answer to the
60:05
observation is that we could call cash
60:08
here and we do call cashier and we could
60:10
call cashier and the usual use of cash
60:14
is just to save data in memory with the
60:18
intent to reuse it that's certainly why
60:21
it's being called here because we're
60:22
using links for but in my example it
60:26
would also have the effect of making the
60:30
output of this stage available in memory
60:32
although not on not an HDFS but in the
60:34
memory of these workers and the paper
60:39
never talks about this possibility and
60:45
I'm not really sure what's going on
60:46
maybe that would work or maybe the fact
60:49
that the cash requests are merely
60:52
advisory and maybe evicted if the
60:56
workers run out of space means that
60:59
calling cash doesn't give you it isn't
61:01
like a reliable directed to make sure
61:04
the data really is available it's just
61:06
well it'll probably be available on most
61:08
nodes but not all nodes because remember
61:10
even a single node loses its data and
61:16
we're gonna have to do a bunch of
61:17
recomputation so III I'm guessing that
61:22
persists with replication is a firm
61:25
directive to guarantee that the data
61:28
will be available even if there's a
61:29
failure
61:30
I don't really know it's a good question
61:40
alright okay so that's the programming
61:46
model and the execution model and the
61:48
failure strategy and by the way just a
61:52
beat on the failure strategy a little
61:54
bit more the way these systems do
61:56
failure recovery is it's not a minor
62:00
thing as as people build bigger and
62:04
bigger clusters with thousands and
62:06
thousands of machines you know the
62:08
probability that job will be interrupted
62:10
by at least one worker failure it really
62:13
does start to approach one and so the
62:16
the designs recent designs intended to
62:20
run on big clusters have really been to
62:22
a great extent dominated by the failure
62:25
recovery strategy and that's for example
62:28
a lot of the explanation for why SPARC
62:31
insists that the transformations be
62:35
deterministic and why the are these its
62:39
rdd's are immutable because you know
62:44
that's what allows it to recover from
62:47
failure by simply recomputing one
62:49
partition instead of having to start the
62:51
entire computation from scratch and
62:53
there have been in the past plenty of
62:56
proposed sort of cluster big data
62:59
execution models in which there really
63:02
was mutable data and in which
63:03
computations could be non-deterministic
63:06
make if you look up distributed shared
63:08
memory systems those all support mutable
63:10
data and they support non-deterministic
63:14
execution but because of that they tend
63:18
not to have a good failure strategy so
63:20
you know thirty years ago when a big
63:22
cluster was for computers none of this
63:25
mattered because the failure probability
63:26
was little very low and so many
63:29
different kinds of computation models
63:32
seemed reasonable then but as the
63:36
clusters have grown to be hundreds and
63:37
thousands of workers really the only
63:41
models that have survived are ones for
63:44
which you can devise a very efficient to
63:47
failure recovery strategy that does not
63:49
require backing all the way up to the
63:52
beginning
63:53
and restarting the paper talks about
63:56
this a little bit when it's criticizing
63:57
I'm distributed shared memory and it's a
64:01
very valid criticism I bet it's a big
64:05
design constraint okay so the sparks not
64:14
perfect for all kinds of processing it's
64:17
really geared up for batch processing of
64:19
giant amounts of data bulk bulk data
64:23
processing so if you have terabytes of
64:25
data and you want to you know chew away
64:27
on it for for a couple hours smart great
64:31
if you're running a bank and you need to
64:34
process bank transfers or people's
64:37
balance queries then SPARC is just not
64:40
relevant to that kind of processing
64:43
known or to sort of typical websites
64:45
where I log into you know I access
64:48
Amazon and I want to order some paper
64:52
towels and put them into my shopping
64:53
cart SPARC is not going to help you
64:55
maintain this part the shopping cart
64:58
SPARC may be useful for analyzing your
65:00
customers buying habits sort of offline
65:03
but not for sort of online processing
65:07
the other sort of kind of a little more
65:11
close to home situation that spark in
65:14
the papers not so great at is stream
65:15
processing i SPARC definitely assumes
65:18
that all the input is already available
65:19
but in many situations the input that
65:22
people have is really a stream of input
65:26
like they're logging all user clicks on
65:28
their web sites and they want to analyze
65:30
them to understand user behavior you
65:32
know it's not a kind of fixed amount of
65:35
data is really a stream of input data
65:36
and you know SPARC as in describing the
65:40
paper doesn't really have anything to
65:42
say about processing streams of data but
65:46
it turned out to be quite close to home
65:47
for people who like to use spark and and
65:51
now there's a variant of SPARC called
65:52
spark streaming that that is a little
65:54
more geared up to kind of processing
65:57
data as it arrives and you know sort of
65:59
breaks it up into smaller batches and
66:01
runs in a batch at a time to spark
66:05
so it's good for a lot of bad stuff but
66:07
that's certainly on to be thing right to
66:10
wrap up the UH you should view spark as
66:13
a kind of evolution after MapReduce and
66:16
I may fix some expressivity and
66:19
performance sort of problems or that
66:25
MapReduce has what a lot of what SPARC
66:28
is doing is making the data flow graph
66:31
explicit sort of he wants you to think
66:34
of computations in the style of figure
66:36
three of entire lineage graphs stages of
66:39
computation and the data moving between
66:41
these stages and it does optimizations
66:44
on this graph and failure recovery is
66:47
very much thinking about the lineage
66:49
graph as well so it's really part of a
66:52
larger move and big data processing
66:54
towards explicit thinking about the data
66:57
flow graphs as a way to describe
67:00
computations a lot of the specific win
67:04
and SPARC have to do with performance
67:06
part of the prepend these are
67:09
straightforward but nevertheless
67:11
important some of the performance comes
67:13
from leaving the data in memory between
67:15
transformations rather than you know
67:18
writing them to GFS and then reading
67:20
them back at the beginning of the next
67:21
transformation which you essentially
67:23
have to do with MapReduce and the other
67:25
is the ability to define these data sets
67:28
these are Dedes and tell SPARC to leave
67:32
this RDD in memory because I'm going to
67:34
reuse it again and subsequent stages and
67:37
it's cheaper to reuse it than it is to
67:39
recompute it and that sort of a thing
67:41
that's easy and SPARC and hard to get at
67:45
in MapReduce and the result is a system
67:48
that's extremely successful and
67:51
extremely widely used and if you deserve
67:55
real success okay that that's all I have
67:59
to say and I'm happy to take questions
68:01
if anyone has them
68:09
you