字幕記錄


00:00
all right let's get started
00:05
this is 682 for distributed systems so
00:11
I'd like to start with just a brief
00:13
explanation of what I think a
00:14
distributed system is you know the core
00:18
of it is a set of cooperating computers
00:21
that are communicating with each other
00:23
over networked to get some coherent task
00:26
done and so the kinds of examples that
00:29
we'll be focusing on in this class are
00:31
things like storage for big websites or
00:34
big data computations such as MapReduce
00:38
and also somewhat more exotic things
00:41
like peer-to-peer file sharing so
00:43
they're all just examples the kinds of
00:45
case studies we'll look at and the
00:48
reason why all this is important is that
00:49
a lot of critical infrastructure out
00:51
there is built out of distributed
00:53
systems infrastructure that requires
00:56
more than one computer to get its job
00:57
done or it's sort of inherently needs to
00:59
be spread out physically so the reasons
01:04
why people build this stuff so first of
01:06
all before I even talk about distributed
01:08
systems sort of remind you that you know
01:10
if you're designing a system redesigning
01:12
you need to solve some problem if you
01:14
can possibly solve it on a single
01:16
computer you know without building a
01:18
distributed system you should do it that
01:20
way and there's many many jobs you can
01:23
get done on a single computer and it's
01:25
always easier so distributed systems you
01:29
know you should try everything else
01:30
before you try building distributed
01:32
systems because they're not they're not
01:34
simpler so the reason why people are
01:36
driven to use lots of cooperating
01:39
computers are they need to get
01:41
high-performance and the way to think
01:43
about that is they want to get achieve
01:45
some sort of parallelism lots of CPUs
01:50
lots of memories lots of disk arms
01:52
moving in parallel another reason why
01:56
people build this stuff is to be able to
01:58
tolerate faults
02:05
have two computers do the exact same
02:07
thing if one of them fails you can cut
02:09
over to the other one another is that
02:12
some problems are just naturally spread
02:15
out in space like you know you want to
02:17
do interbank transfers of money or
02:19
something well you know bank a has this
02:21
computer in New York City and Bank B as
02:23
this computer in London you know you
02:26
just have to have some way for them to
02:27
talk to each other and cooperate in
02:29
order to carry that out so there's some
02:31
natural sort of physical reasons systems
02:36
that are inherently physically
02:37
distributed for the final reason that
02:40
people build this stuff is in order to
02:42
achieve some sort of security goal so
02:44
often by if there's some code you don't
02:46
trust or you know you need to interact
02:49
with somebody but you know they may not
02:50
be immediate malicious or maybe their
02:53
code has bugs in it so you don't want to
02:55
have to trust it you may want to split
02:57
up the computation so you know your
02:59
stuff runs over there and that computer
03:01
my stuff runs here on this computer and
03:02
they only talk to each other to some
03:04
sort of narrow narrowly defined network
03:06
protocol assuming we may be worried
03:10
about you know security and that's
03:13
achieved by splitting things up into
03:14
multiple computers so that they can be
03:16
isolated the most of this course is
03:21
going to be about performance and fault
03:23
tolerance although the other two often
03:26
work themselves in by way of the sort of
03:28
constraints on the case studies that
03:30
we're going to look at you know all
03:32
these distributed systems so these
03:34
problems are because they have many
03:36
parts and the parts execute concurrently
03:39
because there are multiple computers you
03:42
get all the problems that come up with
03:43
concurrent programming all the sort of
03:45
complex interactions and we're
03:46
timing-dependent stuff and that's part
03:49
of what makes distributed systems hard
03:51
another thing that makes distributed
03:54
systems hard is that because again you
03:56
have multiple pieces plus a network you
03:59
can have very unexpected failure
04:02
patterns that is if you have a single
04:04
computer it's usually the case either
04:06
computer works or maybe it crashes or
04:08
suffers a power failure or something but
04:11
it pretty much either works or doesn't
04:12
work distributed systems made up of lots
04:14
of computers you can have partial
04:15
failures that is some pieces stopped
04:18
working other people other pieces
04:20
continue working or maybe the computers
04:22
are working but some part of the network
04:24
is broken or unreliable so partial
04:28
failures is another reason why
04:30
distributed systems are hard and a final
04:50
reason why it's hard is that you know
04:51
them the original reason to build the
04:53
distributed system is often to get
04:54
higher performance to get you know a
04:57
thousand computers worth of performance
04:59
or a thousand disk arms were the
05:01
performance but it's actually very
05:03
tricky to obtain that thousand X speed
05:06
up with a thousand computers there's
05:09
often a lot of roadblocks thrown in your
05:12
way so the Elven takes a bit of careful
05:20
design to make the system actually give
05:22
you the performance you feel you deserve
05:24
so solving these problems of course
05:26
going to be all about you know
05:27
addressing these issues the reason to
05:31
take the course is because often the
05:33
problems and the solutions are quite
05:35
just technically interesting they're
05:38
hard problems for some of these problems
05:40
there's pretty good solutions known for
05:42
other problems they're not such great
05:44
solutions now distributed systems are
05:47
used by a lot of real-world systems out
05:50
there like big websites often involved
05:53
you know vast numbers computers that are
05:55
you know put together as distributed
05:57
systems when I first started teaching
06:00
this course it was distributed systems
06:03
were something of an academic curiosity
06:05
you know people thought oh you know at a
06:07
small scale they were used sometimes and
06:09
people felt that oh someday they'd be
06:11
might be important but now particularly
06:14
driven by the rise of giant websites
06:16
that have you know vast amounts of data
06:18
and entire warehouses full of computers
06:21
distributed systems in the last
06:23
twenty years have gotten to be very
06:25
seriously important part of computing
06:29
infrastructure this means that there's
06:32
been a lot of attention paid to them a
06:34
lot of problems have been solved but
06:36
there's still quite a few unsolved
06:37
problems so if you're a graduate student
06:39
or you're interested in research there's
06:42
a lot to you let a lot of problems yet
06:45
to be solved in distributed systems that
06:47
you could look into his research and
06:49
finally if you like building stuff this
06:51
is a good class because it has a lab
06:54
sequence in which you'll construct some
06:56
fairly realistic distributed systems
06:58
focused on performance and fault
07:00
tolerance
07:01
so you've got a lot of practice building
07:04
districts just building distributed
07:06
systems and making them work all right
07:09
let me talk about course structure a bit
07:12
before I get started on real technical
07:16
content you should be able to find the
07:19
course website using Google and on the
07:22
course website is the lab assignments
07:24
and course schedule and also link to a
07:28
Piazza page where you can post questions
07:31
get answers the course staff I'm Robert
07:35
Morris I'll be giving the lectures I
07:36
also have for TAS you guys want to stand
07:39
up and show your faces the TAS are
07:44
experts at in particular at doing the
07:47
solving the labs they'll be holding
07:49
office hours so if you have questions
07:51
about the labs you can come you should
07:53
go to office hours or you could post
07:55
questions to Piazza the course has a
07:59
couple of important components one is
08:04
this lectures there's a paper for almost
08:09
every lecture there's two exams
08:17
there's the labs programming labs and
08:22
there's an optional final project that
08:25
you can do instead of one of the labs
08:36
the lectures will be about two big ideas
08:38
in distributed systems they'll also be a
08:42
couple of lectures that are more about
08:43
sort of lab programming stuff a lot of
08:47
our lectures will be taken up by case
08:48
studies a lot of the way that I sort of
08:50
try to bring out the content of
08:53
distributed systems is by looking at
08:55
papers some academics some written by
08:58
people in industry describing real
09:01
solutions to real problems
09:05
these lectures actually be videotaped
09:07
and I'm hoping to post them online so
09:10
that you can so if you're not here or
09:12
you want to review the lectures you'll
09:15
be able to look at the videotape
09:16
lectures the papers again there's one to
09:20
read per week most of a research paper
09:22
some of them are classic papers like
09:24
today's paper which I hope some of you
09:26
have read on MapReduce it's an old paper
09:28
but it was the beginning of its spurred
09:31
an enormous amount of interesting work
09:33
both academic and in the real world so
09:35
some are classic and some are more
09:37
recent papers sort of talking about more
09:40
up-to-date research what people are
09:41
currently worried about and from the
09:44
papers we'll be hoping to tease out what
09:46
the basic problems are what ideas people
09:49
have had that might or might not be
09:50
useful in solving distributed system
09:52
problems we'll be looking at sometimes
09:54
in implementation details in some of
09:56
these papers because a lot of this has
09:58
to do with actual construction of of
10:01
software based systems and we're also
10:03
going to spend a certain time looking at
10:04
evaluations people evaluating how fault
10:07
tolerant their systems by measuring them
10:09
or people measuring how much performance
10:11
or whether they got performance
10:12
improvement at all so I'm hoping that
10:17
you'll read the papers before coming to
10:19
class the lectures are maybe not going
10:22
to make as much sense if you haven't
10:24
already read the lecture because there's
10:26
not enough time to both explaining all
10:28
the content of the paper and have a sort
10:30
of interesting reflection on what the
10:32
paper means online class so you really
10:35
got to read the papers before I come
10:37
into class and hopefully one of the
10:38
things you'll learn in this class is how
10:40
to read a paper rapidly in the fish
10:42
and skip over the parts that maybe
10:44
aren't that important and sort of focus
10:47
on teasing out the important ideas on
10:51
the website there's for every link to
10:53
buy the schedule there's a question that
10:56
you should submit an answer for for
10:59
every paper I think the answers are due
11:00
at midnight and we also ask that you
11:02
submit a question you have about the
11:04
paper through the website in order both
11:08
to give me something to think about as
11:09
I'm preparing the lecture and if I have
11:11
time I'll try to answer at least a few
11:13
of the questions by email and the
11:17
question and the answer for each paper
11:18
due midnight the night before there's
11:22
two exams there's a midterm exam in
11:24
class I think on the last class meeting
11:26
before spring break and there's a final
11:32
exam during final exam week at the end
11:36
of the semester the exams are going to
11:37
focus mostly on papers and the labs and
11:42
probably the best way to prepare for
11:44
them as well as attending lecture and
11:46
reading the papers a good way to prepare
11:49
for the exams is to look at all exams we
11:51
have links to 20 years of old exams and
11:55
solutions and so you look at those and
11:57
sort of get a feel for what kind of
11:58
questions that I like to ask and indeed
12:01
because we read many of the same papers
12:03
inevitably I ask questions each year
12:05
that can't help but resemble questions
12:08
asked in previous years the labs there's
12:15
for programming labs the first one of
12:17
them is due Friday next week lab one is
12:25
a simple MapReduce lab to implement your
12:31
own version of the paper they write
12:33
today in which I'll be discussing in a
12:35
few minutes
12:36
lab 2 involves using a technique called
12:40
raft in order to get fault taught in
12:43
order to sort of allow in theory allow
12:47
any system to be made fault tolerant by
12:49
replicating it and having this raft
12:51
technique manage the replication and
12:53
manage sort of automatic cut
12:55
or if there's a field if one of the
12:57
replicated servers fails so this is rav4
13:00
fault tolerance in lad 3 you'll use your
13:08
raft implementation in order to build a
13:11
fault tolerant key-value server it'll be
13:18
replicated and fault tolerant in a lab 4
13:22
you'll take your replicated key-value
13:25
server and clone it into a number of
13:28
independent groups and you'll split the
13:30
data in your key value storage system
13:33
across all of these individual
13:35
replicated groups to get parallel
13:36
speed-up by running multiple replicated
13:39
groups in parallel and you'll also be
13:42
responsible for moving the various
13:47
chunks of data between different servers
13:50
as they come and go without dropping any
13:52
balls so this is a what's often called a
13:54
sharded key value service sharding
14:03
refers to splitting up the data
14:04
partitioning the data among multiple
14:07
servers in order to get parallel speed
14:10
up if you want instead of doing lab 4
14:16
you can do a project of your own choice
14:19
and the idea here is if you have some
14:21
idea for a distributed system you know
14:23
in the style of some of the distributed
14:26
systems we talked about in the class if
14:27
you have your own idea that you want to
14:28
pursue and you like to build something
14:30
and measure whether it worked in order
14:32
to explore your idea you can do a
14:34
project and so for a project you'll pick
14:38
some teammates because we require that
14:40
projects are done in teams of two or
14:44
three people so like some teammates and
14:47
send your project idea to us and we'll
14:49
think about it and say yes or no and
14:50
maybe give you some advice and then if
14:53
you go ahead and do if we say yes and
14:55
you want to do a project you do that and
14:56
instead of lab 4 and it's due at the end
14:59
of the semester and you know you'll you
15:00
should do some design work and build a
15:05
real system and then in the last day of
15:06
class you'll demonstrate your system
15:08
as well as handing in a short sort of
15:11
written report to us about what you
15:12
built and I posted on the website some
15:17
some ideas which might or may not be
15:20
useful for you to sort of spur thoughts
15:22
about what projects you might build but
15:25
really the best projects are one where
15:27
sort of you have a good idea for the
15:30
project and the idea is if you want to
15:32
do a project you should choose an idea
15:34
that's sort of in the same vein as the
15:36
systems that were talked about in this
15:39
class
15:40
okay back to labs the lab Greed's they
15:44
we give you you hand in your lab code
15:46
and we run some tests against it and
15:47
you're great early based on how many
15:49
tests you pass we give you all the tests
15:51
that we use those no hidden tests so if
15:55
you implement the lab and it reliably
15:56
passes all the tests and chances are
15:58
good unless there's something funny
16:00
going on which there sometimes is
16:02
chances are good that if you your coop
16:04
passes all the tests when you run it or
16:06
pass all the tests when we run it and
16:07
you'll get a four score full score so
16:10
hopefully there'll be no mystery about
16:11
what score you're likely to get on the
16:13
labs let me warn you that debugging
16:18
these labs can be time-consuming because
16:21
they're distributed systems and a lot of
16:23
concurrency and communication sort of
16:26
strange difficult to debug errors can
16:30
crop up so you really ought to start the
16:34
labs early don't don't even have a lot
16:37
of trouble if you be elapsed to the last
16:39
moment you got to start early if your
16:41
problems please come to the TAS office
16:43
hours and please feel free to ask
16:45
questions about the labs on Piazza and
16:49
indeed I hope if you know the answer
16:51
that you'll answer people's questions on
16:52
Piazza as well all right any questions
16:56
about the mechanics of the course yes
17:10
so the question is what is how does how
17:13
do the different factor these things
17:15
factoring the grade I forget but it's
17:17
all on the it's on the website under
17:20
something I think though it's the labs
17:24
are the single most important component
17:29
okay alright so this is a course about
17:36
about infrastructure for applications
17:39
and so all through this course there's
17:41
going to be a sort of split in the way I
17:42
talk about things between applications
17:45
which are sort of other people the
17:47
customer somebody else writes but the
17:49
applications are going to use the
17:51
infrastructure that we're thinking about
17:53
in this course and so the kinds of
17:55
infrastructure that tend to come up a
17:58
lot our storage communication and
18:13
computation and we'll talk about systems
18:16
that provide all three of these kinds of
18:19
infrastructure the the storage it turns
18:23
out that storage is going to be the one
18:24
we focus most on because it's a very
18:27
well-defined and useful abstraction and
18:30
usually fairly straightforward
18:32
abstraction so people know a lot about
18:34
how to build how to use and build
18:36
storage systems and how to build sort of
18:40
replicated fault tolerant
18:41
high-performance distributed
18:43
implementations of storage we'll also
18:46
talk about some some of our computation
18:48
systems like MapReduce for today is a
18:50
computation system and we will talk
18:54
about communications some but mostly
18:57
from the point is a tool that we need to
18:58
use to build distributed systems like
19:00
computers have to talk to each other
19:01
over a network you know maybe you need
19:03
reliability or something and so we'll
19:06
talk a bit about what we're actually
19:08
mostly consumers of communication if you
19:11
want to learn about communication
19:12
systems as sort of how they work that's
19:17
more the topic of six eight to nine
19:20
so for storage and computation a lot of
19:24
our goal is to be able to discover
19:27
abstractions where use of simplifying
19:31
the interface to these two storage and
19:34
computation distributed storage and
19:36
computation infrastructure so that it's
19:38
easy to build applications on top of it
19:40
and what that really means is that we
19:43
need to we'd like to be able to build
19:45
abstraction that hide the distributed
19:47
nature of these of these systems so the
19:51
dream which is rarely fully achieved but
19:54
the dream would be to be able to build
19:56
an interface that looks to an
19:58
application is if it's a non distributed
20:00
storage system just like a file system
20:02
or something that everybody already
20:03
knows how to program and has a pretty
20:05
simple model semantics we'd love to be
20:08
able to build interfaces that look and
20:09
act just like non distributed storage
20:13
and computation systems but are actually
20:17
you know vast extremely high performance
20:20
fault tolerant distributed systems
20:22
underneath so we both have abstractions
20:30
and you know as you'll see as a course
20:33
goes on we sort of you know only part of
20:37
the way there it's rare that you find an
20:39
abstraction for a distributed version of
20:41
storage or computation that has simple
20:44
behavior but he's just like the non just
20:49
non distributed version of storage that
20:51
everybody understands but people getting
20:52
better at this and we're gonna try to
20:59
study the ways and what people have
21:01
learned about building such abstractions
21:03
ok so what kind of what kind of topics
21:08
show up is we're considering these
21:10
abstractions the first one this first
21:13
topic general topic that we'll see a lot
21:15
a lot of the systems we looked at have
21:18
to do with implementation so for example
21:24
the kind of tools that you see a lot for
21:27
for ways people learn how to build these
21:30
systems are things like remote procedure
21:31
call
21:32
whose goal is to mask the fact that
21:35
we're communicating over an unreliable
21:36
Network another kind of implementation
21:44
topic that we'll see a lot is threads
21:49
which are a programming technique that
21:51
allows us to harness what allows us to
21:55
harness multi-core computers but maybe
21:56
more important for this class threads
21:58
are a way of structuring concurrent
22:00
operations in a way that's hopefully
22:03
simplifies the programmer view of those
22:06
concurrent operations and because we're
22:09
gonna use threads a lot it turns out
22:10
we're going to need to also you know
22:12
just as from an implementation level
22:13
spend a certain amount of time thinking
22:15
about concurrency control things like
22:16
locks and the main place that these
22:25
implementation ideas will come up in the
22:26
class they'll be touched on in many of
22:28
the papers but you're gonna come face
22:30
the face of all this in a big way in the
22:31
labs you need to build distributed you
22:34
know do the programming for distributed
22:35
system and these are like a lot of sort
22:38
of important tools you know beyond just
22:41
sort of ordinary programming these are
22:43
some of the critical tools that you'll
22:45
need to use to build distributed systems
22:50
another big topic that comes up in all
22:54
the papers we're going to talk about is
22:55
performance usually the high-level goal
23:05
of building a distributed system is to
23:07
get what people call scalable speed-up
23:11
so we're looking for scalability and
23:17
what I mean by scalability or scalable
23:21
speed-up is that if I have some problem
23:23
that I'm solving with one computer and I
23:26
buy a second computer to help me execute
23:29
my problem if I can now solve the
23:31
problem in half the time or maybe solve
23:34
twice as many problem instances you know
23:37
per minute on two computers as I had on
23:39
one then that's an example of
23:42
scalability so
23:44
sort of two times you know computers or
23:47
resources gets me you know two times
23:54
performance or throughput and this is a
24:01
huge hammer if you can build a system
24:02
that actually has this behavior Namie
24:05
that if you increase the number of
24:07
computers you throw at the problem by
24:08
some factor you get that factor more
24:12
throughput more performance out of the
24:14
system that's a huge win because you can
24:17
buy computers with just money right
24:21
whereas if in order to get the
24:23
alternative to this is that in order to
24:27
get more performance you have to pay
24:28
programmers to restructure your software
24:31
to get better performance to make it
24:33
more efficient or to apply some sort of
24:35
specialized techniques better algorithms
24:37
or something if you have to pay
24:39
programmers to fix your code to be
24:42
faster that's an expensive way to go
24:45
we'd love to be able just oh by thousand
24:47
computers instead of ten computers and
24:49
get a hundred times more throughput
24:51
that's fantastic and so this sort of
24:53
scalability idea is a huge idea in the
24:56
backs of people's heads when they're
24:58
like building things like big websites
24:59
that run on are you know building full
25:01
of computers if the building full of
25:04
computers is there to get a sort of
25:06
corresponding amount of performance but
25:09
you have to be careful about the design
25:12
in order to actually get that
25:13
performance so often the way this looks
25:19
when we're looking at diagrams or I'm
25:21
writing diagrams in this course is that
25:23
I'm not supposing we're building a
25:25
website ordinarily you might have a
25:27
website that you know has a HTTP server
25:32
let's say it has some types of users
25:36
many web browsers and they talk to a web
25:42
server running Python or PHP or whatever
25:44
sort of web server and the web server
25:49
talks to some kind of database
25:54
you know when you have one or two users
25:56
you can just have one computer running
25:58
both and maybe a computer for the web
26:00
server and a computer from the database
26:02
but maybe all of a sudden you get really
26:03
proper popular and you'll be up and
26:05
you've you know 100 million people sign
26:08
up your service ID how do you how do you
26:13
fix your c-certainly can it support
26:15
millions of people on a single computer
26:17
except by extremely careful
26:20
labor-intensive optimization but you
26:24
don't have time for so typically the way
26:27
you're going to speed things up the
26:29
first thing you do is buy more web
26:31
servers and just split the user so that
26:33
you know how few users or some fraction
26:35
the user go to a web server 1 and the
26:37
other half you send them to a web server
26:39
2 and because maybe you're building I
26:45
don't know what reddit or something
26:47
where all the users need to see the same
26:49
data ultimately you have all the web
26:51
servers talk to the backend and you can
26:54
keep on adding web servers for a long
26:55
time here and so this is a way of
27:01
getting parallel speed up on the web
27:03
server code you know if you're running
27:04
PHP or Python maybe it's not too
27:05
efficient as long as each individual web
27:09
server doesn't put too much load on the
27:11
database you can add a lot of web
27:12
servers before you run into problems but
27:17
this kind of scalability is rarely
27:20
infinite unfortunately certainly not
27:23
without serious thought and so what
27:25
tends to happen with these systems is
27:26
that at some point after you have 10 or
27:29
20 or 100 web servers all talking to the
27:31
same database now all of a sudden the
27:33
database starts to be a bottleneck and
27:35
adding more web servers no longer helps
27:37
so it's rare that you get full scale
27:38
ability to sort of infinite numbers of
27:42
adding infinite numbers of computers
27:44
some point you run out of gas because
27:46
the place at which you are adding more
27:48
computers is no longer the bottleneck by
27:51
having lots and lots of web servers we
27:52
basically moved the bottleneck
27:54
I think it's limiting performance from
27:56
the web servers to the database and at
28:01
this point actually you almost certainly
28:03
have to do a bit of design work because
28:05
it's rare that you can
28:07
there's any straightforward way to take
28:09
a single database and sort of refactor
28:13
things with it or you can take data
28:17
sorta in a single database and refactor
28:19
it so it's split over multiple databases
28:23
but it's often a fair amount of work and
28:26
because it's awkward but people many
28:29
people actually need to do this we're
28:32
gonna see a lot of examples in this
28:33
course in which the distributed system
28:34
people are talking about is a storage
28:37
system because the authors were running
28:40
you know something like a big website
28:42
that ran out of gas on a single database
28:45
or storage servers anyway so the
28:49
scalability story is we love to build
28:51
systems that scale this way but you know
28:56
it's hard to make it or takes work off
28:59
and design work to push this idea
29:01
infinitely far ok so another big topic
29:11
that comes up a lot is fault tolerance
29:22
if you're building a system with a
29:24
single computer in it well a single
29:27
computer often can stay up for years
29:29
like I have servers in my office that
29:31
have been up for years without crashing
29:33
you know the computer is pretty reliable
29:35
the operating systems reliable
29:37
apparently the power in my building is
29:39
pretty reliable so it's not uncommon to
29:41
have single computers it's just a for
29:43
amazing amount of time however if you're
29:46
building systems out of thousands of
29:48
computers then even if each computer can
29:50
be expected to stay up for a year with a
29:53
thousand computers that means you're
29:55
going to have like about three computer
29:57
failures per day in your set of a
30:00
thousand computers so solving big
30:02
problems with big distributed systems
30:04
turns sort of very rare fault tolerance
30:07
very real failure very rare failure
30:10
problems into failure problems that
30:12
happen just all the time in a system
30:14
with a thousand computers there's almost
30:15
certainly always something broken it's
30:18
always some computer that's either
30:20
crashed or mysteriously you know running
30:23
incorrectly or slowly or doing the wrong
30:24
thing or maybe there's some piece of the
30:26
network with a thousand computers we got
30:28
a lot of network cables and a lot of
30:31
network switches and so you know there's
30:33
always some network cable that somebody
30:35
stepped on and is unreliability or
30:37
network cable that fell out or some
30:38
networks which whose fan is broken and
30:40
the switch overheated and failed there's
30:43
always some little problem somewhere in
30:44
your building sized distributed system
30:48
so big scale turns problems from very
30:52
rare events you really don't have to
30:54
worry about that much into just constant
30:56
problems that means the failure has to
30:59
be really or the response the masking of
31:02
failures the ability to proceed without
31:03
failures just has to be built into the
31:05
design because there's always failures
31:08
and you know it's part of building you
31:12
know convenient abstractions for
31:14
application programmers we really need
31:16
that but to be able to build
31:17
infrastructure that as much as possible
31:19
hides the failures from application
31:21
programmers or masks them or something
31:23
so that every application programmer
31:26
doesn't have to have a complete
31:28
complicated story for all the different
31:30
kinds of failures that can occur there's
31:35
a bunch of different notions that you
31:37
can have about what it means to be fault
31:41
tolerant about a little more but you
31:43
know exactly what we mean by that we'll
31:46
see a lot of a lot of different flavors
31:48
but among the more common ideas you see
31:50
one is availability so you know some
31:58
systems are designed so that under some
32:01
kind certain kinds of failures not all
32:03
failures but certain kinds of failures
32:05
the system will keep operating despite
32:09
the failure while providing you know
32:13
undamaged service the same kind of
32:16
service it would have provided even if
32:17
there had been no failure so some
32:19
systems are available in that sense that
32:21
up and up you know so if you build a
32:24
replicated service that maybe has two
32:25
copies you know one of the replicas
32:28
replica servers fail fails maybe the
32:31
other server can continue operating
32:34
they both fail of course you can't you
32:37
know you can't promise availability in
32:40
that case so available systems usually
32:42
say well under certain set of failures
32:44
we're going to continue providing
32:46
service we're going to be available more
32:48
failures than that occur it won't be
32:50
available anymore
32:52
another kind of fault tolerance you
32:55
might you might have or in addition to
32:57
availability or by itself as
32:59
recoverability and what this means is
33:06
that if something goes wrong maybe the
33:08
service will stop working that it is
33:10
it'll simply stop responding to requests
33:13
and it will wait for someone to come
33:15
along and repair or whatever went wrong
33:17
but after the repair occurs the system
33:19
will be able to continue as if nothing
33:21
bad had gone wrong right so this is sort
33:24
of a weaker requirement than
33:25
availability because here we're not
33:27
going to do anything while while the
33:29
failed come until the failed component
33:31
has been repaired but the fact that we
33:33
can get up get going again without you
33:37
know but without any loss of correctness
33:39
is still a significant requirement it
33:41
means you know recoverable systems
33:43
typically need to do things like save
33:45
their latest date on disk or something
33:48
where they can get it back
33:49
you know after the power comes back up
33:51
and even among available systems in
33:56
order for a system to be useful in real
33:57
life usually what the way available
34:01
systems are SPECT is that they're
34:04
available until some number of failures
34:07
have happened if too many failures have
34:09
happened an available system will stop
34:11
working or you know will stop responding
34:14
at all but when enough things have been
34:18
repaired it'll continue operating so a
34:21
good available system will sort of be
34:23
recoverable as well in a sensitive to
34:25
many failures occur
34:26
it'll stop answering but then will
34:28
continue correctly after that so this is
34:35
what we love - this is what we'd love to
34:38
obtain the biggest hammer what we'll see
34:43
a number of approaches to solving these
34:45
problems there's really sort of
34:47
things that are the most important tools
34:50
we have in this department one is
34:52
non-volatile storage so that you know
34:55
something crash power fails or whatever
34:58
there's a building wide power failure we
35:01
can use non-volatile store it's like
35:02
hard drives or flash or solid-state
35:05
drives or something to sort of store a
35:07
check point or a log of the state of a
35:12
system and then when the power comes
35:14
back up or somebody repairs our power
35:16
suppliers notice what we'll be able to
35:18
read our latest state off the hard drive
35:20
and continue from there so so one tool
35:24
is sort of non-volatile storage and the
35:29
management of non-volatile storage just
35:31
Ning comes up a lot because non-volatile
35:32
storage tends to be expensive to update
35:35
and so a huge amount of the sort of
35:37
nitty-gritty of building sort of
35:39
high-performance fault-tolerant systems
35:42
is in you know clever ways to avoid
35:45
having to write the non-volatile storage
35:47
too much in the old days and even today
35:49
you know what writing non-volatile
35:53
storage meant was moving a disk arm and
35:55
waiting for a disk platter to rotate
35:58
both of which are agonizingly slow on
36:00
the scale of you know three gigahertz
36:04
microprocessors good things like flash
36:06
life is quite a bit better but still
36:08
requires a lot of thought to get good
36:10
performance out of and the other big
36:12
tool we have for fault tolerance is
36:14
replication and the management of
36:20
replicated copies is sort of tricky you
36:22
know that sort of he problem lurking in
36:26
any replicated system where we have two
36:28
servers each with a supposedly identical
36:30
copy of the system state the key problem
36:34
that comes up is always that the two
36:36
replicas will accidentally drift out of
36:38
sync and will stop being replicas right
36:41
and this is just you know with the back
36:43
of the every design that we're gonna see
36:45
for using replication to get fault
36:47
tolerance and lab - a lot - you're all
36:51
about management management of
36:53
replicated copies for fault tolerance
36:57
as you'll see it's pretty complex a
37:03
final topic final cross-cutting topic is
37:10
consistency so it's an example of what I
37:17
mean by consistency supposing we're
37:19
building a distributed storage system
37:22
and it's a key/value service so it just
37:24
supports two operations maybe there's a
37:26
put operation and you give it a key and
37:29
a value and that the storage system sort
37:33
of stashes away the value under as the
37:36
value for this key maintains it's just a
37:38
big table of keys and values and then
37:40
there's a good operation you the client
37:43
sends it a key and the storage service
37:47
is supposed to you know respond with the
37:49
value of the value it has stored for
37:50
that key right and this is kind of good
37:52
when I can't think of anything else as
37:54
an example of a distributed system all
37:56
Oh without key value services and
38:00
they're very useful right they're just
38:01
sort of a kind of fundamental simple
38:05
version of a storage system so of course
38:09
if you're an application programmer it's
38:11
helpful if these two operations kind of
38:15
have meanings attached to them that you
38:16
can go look in the manual and the manual
38:18
says you know what it what it means what
38:21
you'll get back if you call get right
38:23
and sort of what it means for you to
38:25
call put all right so it's immediate so
38:28
some sort of spec for what they meant
38:29
otherwise like who knows how can you
38:31
possibly write an application without a
38:32
description of what putting get are
38:35
supposed to do and this is the topic of
38:38
consistency and the reason why it's
38:40
interesting in distributed systems is
38:42
that both for performance and for fault
38:46
tolerant reasons fault tolerance reason
38:48
we often have more than one copy of the
38:50
data floating around so you know in a
38:53
non distributed system where you just
38:55
have a single server with a single table
38:59
there's often although not always but
39:02
there's often like relatively no
39:04
ambiguity about what pudding get could
39:05
possibly mean right in
39:07
to ative Lee you know what put means is
39:08
update the table and what get means is
39:10
just get me the version that's stored in
39:12
the table which but in a distributed
39:17
system where there's more than one copy
39:18
in the data due to replication or
39:20
caching or who knows what there may be
39:23
lots of different versions of this key
39:30
value pair floating around like if one
39:32
of the replicas you know if supposing
39:34
some client issues a put and you know
39:36
there's two copies of the the server so
39:43
they both have a key value table right
39:48
and maybe key one has value twenty on
39:51
both of them and then some client issues
39:55
a put nice we have client over here and
39:58
it's gonna send a put it wants to update
40:00
the value of one to be twenty-one all
40:03
right maybe it's counting stuff in this
40:04
key value server so sends a put with key
40:09
one and value twenty one it sends it to
40:13
the first server and it's about to send
40:15
the same put you know wants to update
40:18
both copies right it keeps them in sync
40:20
it's about to send this put but just
40:22
before it sends to put to the second
40:23
server crashes I power failure or bug an
40:26
operating system or something so now the
40:28
state were left in sadly is that we sent
40:30
this put and so we've updated one of the
40:35
two replicas didn't have value twenty
40:37
one but the other ones still with twenty
40:38
now somebody comes along and reads with
40:40
a get and they might get they want to
40:42
read the value associated with key one
40:45
they might get twenty one or they might
40:46
get twenty depending on who they talk to
40:48
and even if the rule is you always talk
40:50
to the top server first if you're
40:52
building a fault-tolerant system the
40:53
actual rule has to be oh you talk to the
40:56
top server first unless it's failed in
40:58
which case you talk to the bottom server
41:00
so either way someday you risk exposing
41:03
this stale copy of the data to some
41:06
future again it could be that many gets
41:08
get the updated twenty one and then like
41:10
next week all of a sudden some get
41:12
yields you know a week old copy of the
41:14
data so that's not very consistent
41:19
right so in order but you know it's the
41:23
kind of thing that could happen right
41:25
we're not careful so you know we need to
41:29
have we need to actually write down what
41:32
the rules are going to be about puts and
41:33
gets given this danger of due to
41:36
replication and it turns out there's
41:39
many different definitions you can have
41:42
of consistency you know many of them are
41:47
relatively straightforward many of them
41:48
sound like well I get yields the you
41:52
know value put by the most recently
41:55
completed put all right so that's
42:00
usually called strong consistency it
42:02
turns out also it's very useful to build
42:05
systems that have much weaker
42:06
consistency there for example do not
42:08
guarantee anything like a get sees the
42:11
value written by the most recent put and
42:15
the reason so there's there strongly
42:18
consistent systems they usually have
42:23
some version that gets seen most recent
42:25
puts although you have to there's a lot
42:27
of details to work out there's also
42:28
weekly consistent many sort of flavors
42:32
of weekly consistent systems that do not
42:33
make any such guarantee that you know
42:36
may guarantee well you're you know if
42:38
someone does a put then you may not see
42:41
the put you may see old values that
42:43
weren't updated by the put for an
42:45
unbounded amount of time maybe and the
42:49
reason for people being very interested
42:51
in wheat consistency schemes is that
42:53
strong consistency that is having Rezac
42:57
Chua lessee be guaranteed to see the
43:00
most recent right that's a very
43:02
expensive spec to implement because what
43:07
it means is almost certainly that you
43:08
have to somebody has to do a lot of
43:10
communication in order to actually
43:12
implement some notion of strong
43:14
consistency if you have multiple copies
43:16
it means that either the writer or the
43:20
reader or maybe both has to consult
43:22
every copy like in this case where you
43:26
know maybe a client crash left one
43:28
updated but not the other if we wanted
43:30
to implement strong
43:31
Sisseton see in them maybe a simple way
43:33
in this system we'd have readers read
43:35
both of the copies or if there's more
43:37
than one copy all the copies and use the
43:39
most recently written value that they
43:41
find but that's expensive that's a lot
43:44
of chitchat to read one value so in
43:49
order to avoid communication as much as
43:51
possible particularly if replicas are
43:54
far away people build weak systems that
43:56
might actually allow the stale read of
43:59
an old value in this case although
44:02
there's often more semantics attached to
44:05
that to try to make these weak schemes
44:06
more useful and we're this communication
44:10
problem you know strong consistency
44:13
requiring expensive communication where
44:16
this really runs you into trouble is
44:19
that if we're using replication for
44:21
fault tolerance then we really want the
44:24
replicas to have independent failure
44:26
probability to have uncorrelated failure
44:29
so for example putting both of the
44:31
replicas of our data in the same iraq in
44:34
the same machine room it's probably a
44:37
really bad idea
44:38
because if someone trips over the power
44:39
cable to that rack both of our copies of
44:42
our data are going to die because
44:44
they're both attached to the same power
44:46
cable in the same rack so in the search
44:49
for making replicas as independent and
44:53
failure as possible in order to get
44:54
decent fault tolerance people would love
44:57
to put different replicas as far apart
45:00
as possible like in different cities or
45:02
maybe on opposite sides of the continent
45:05
so an earthquake that destroys one data
45:07
center will be extremely unlikely to
45:09
also destroy the other data center that
45:11
as the other copy you know so we'd love
45:15
to be able to do that if you do that
45:17
then the other copy is thousands of
45:20
miles away and the rate at which light
45:23
travels means that it may take on the
45:26
order of milliseconds or tens of
45:28
milliseconds to communicate to a data
45:31
center across the continent in order to
45:33
update the other copy of the data and so
45:36
that makes this the communication
45:38
required for strong consistency for good
45:40
consistency potentially extremely
45:42
expensive like every time you want to do
45:44
one of these put opera
45:45
or maybe again depending on how you
45:46
implement it you might have to sit there
45:49
waiting for like 10 or 20 or 30
45:50
milliseconds in order to talk to both
45:52
copies of the data to ensure that
45:54
they're both updated or or both checked
45:56
to find the latest copy and that
46:01
tremendous expense right this is 10 or
46:04
20 or 30 milliseconds on machines that
46:06
after all I'll execute like a billion
46:07
instructions per second so we're wasting
46:09
a lot of potential instructions while we
46:11
wait people often go much weaker systems
46:14
you're allowed to only update the
46:16
nearest copy you're only consulted
46:17
nearest copy I mean there's a huge sort
46:20
of amount of academic and real-world
46:23
research on how to structure weak
46:26
consistency guarantees so they're
46:28
actually useful to applications and how
46:30
to take advantage of them in order to
46:31
actually get high performance alright so
46:36
that's a lightning preview of the
46:40
technical ideas in the course any
46:43
questions about this before I start
46:46
talking about MapReduce all right I want
46:50
to switch to Map Reduce that's a sort of
46:54
detailed case study that's actually
46:55
going to illustrate most of the ideas
46:57
that we've been talking about here now
47:02
produces a system that was originally
47:07
designed and built and used by Google I
47:11
think the paper dates back to 2004 the
47:15
problem they were faced with was that
47:17
they were running huge computations on
47:20
terabytes and terabytes of data like
47:22
creating an index of all of the content
47:27
of the web or analyzing the link
47:29
structure of the entire web in order to
47:32
identify the most important pages or the
47:35
most authoritative pages as you know the
47:37
whole web is what's even in those days
47:39
tens of terabytes of data building index
47:45
of the web is basically equivalent to a
47:47
sort running sort of the entire data
47:49
sort you know ones like reasonably
47:52
expensive and to run a sort on the
47:55
entire content to the way I've been a
47:56
single computer
47:58
how long would have taken but you know
47:59
it's weeks or months or years or
48:01
something so Google the time was
48:04
desperate to be able to run giant
48:06
computations on giant data on thousands
48:08
of computers in order that the
48:10
computations could finish rapidly it's
48:12
worth it to them to buy lots of
48:14
computers so that their engineers
48:16
wouldn't have to spend a lot of time
48:17
reading the newspaper or something
48:19
waiting for their big compute jobs to
48:22
finish and so for a while they had their
48:27
clever engineer or sort of handwrite you
48:29
know if you needed to write a web
48:30
indexer or some sort of Lincoln outlay a
48:32
blink analysis tool you know Google
48:35
bought the computers and they say here
48:37
engineers you know do write but never
48:38
whatever software you like on these
48:39
computers and you know they would
48:41
laborious ly write the sort of one-off
48:44
manually bitten software to take
48:46
whatever problem they were working on
48:47
and so to somehow farm it out to a lot
48:49
of computers and organize that
48:51
computation and get the data back if you
48:56
only hire engineers who are skilled
48:58
distributed systems experts maybe that's
49:01
ok although even then it's probably very
49:04
wasteful of engineering effort but they
49:07
wanted to hire people who were skilled
49:09
at something else and not necessarily
49:15
engineers who wanted to spend all their
49:16
time writing distributed system software
49:18
so they really needed some kind of
49:20
framework that would make it easy to
49:22
just have their engineers write the kind
49:26
of guts of whatever analysis they wanted
49:28
to do like the sort algorithm or a web
49:30
index or link analyzer or whatever just
49:32
write the guts of that application and
49:34
not be able to run it on a thousands of
49:36
computers without worrying about the
49:39
details of how to spread the work over
49:41
the thousands of computers how to
49:43
organize whatever data movement was
49:45
required how to cope with the inevitable
49:48
failures so they were looking for a
49:50
framework that would make it easy for
49:52
non specialists to be able to write and
49:54
run giant distributed computations and
50:00
so that's what MapReduce is all about
50:03
and the idea is that the programmer just
50:06
write the application designer
50:09
consumer of this distributed computation
50:12
I'm just be able to write a simple map
50:14
function and a simple reduce function
50:16
that don't know anything about
50:18
distribution and the MapReduce framework
50:20
would take care of everything else so an
50:25
abstract view of how what MapReduce is
50:27
up to is it starts by assuming that
50:30
there's some input and the input is
50:33
split up into some a whole bunch of
50:35
different files or chunks in some way so
50:37
we're imagining that no yeah you know
50:43
input file one and put file two etc you
50:51
know these inputs are maybe you know web
50:54
pages crawled from the web or more
50:55
likely sort of big files that contain
50:58
many web each of which contains many web
51:00
files crawl from the web all right and
51:03
the way Map Reduce
51:04
starts is that you're to find a map
51:07
function and the MapReduce framework is
51:09
gonna run your map function on each of
51:15
the input files and of course you can
51:22
see here there's some obvious
51:23
parallelism available can run the maps
51:26
in parallel so the each of these map
51:28
functions only looks as this input and
51:30
produces output the output that a map
51:32
function is required to produce is a
51:33
list you know it takes a file as input
51:36
and the file is some fraction of the
51:39
input data and it produces a list of key
51:42
value pairs as output the map function
51:45
and so for example let's suppose we're
51:48
writing the simplest possible MapReduce
51:50
example a word count MapReduce job goal
51:56
is to count the number of occurrences of
51:58
each word so your map function might
52:00
emit key value pairs where the key is
52:02
the word and the value is just one so
52:06
for every word at C so then this map
52:08
function will split the input up into
52:10
words or everywhere ditzies
52:11
it emits that word as the key and 1 as
52:14
the value and then later on will count
52:16
up all those ones in order to get the
52:18
final output so you know maybe input 1
52:21
has the word
52:23
a in it and the word B in it and so the
52:26
output the map is going to produce is
52:28
key a value one key B value one maybe
52:32
the second not communication sees a file
52:35
that has a B in it and nothing else so
52:38
it's going to implement output b1 maybe
52:43
this third input has an A in it and a C
52:46
in it alright so we run all these maps
52:50
on all the input files and we get this
52:53
intermediate with the paper calls
52:55
intermediate output which is for every
52:57
map a set of key value pairs as output
53:00
then the second stage of the computation
53:03
is to run the reduces and the idea is
53:07
that the MapReduce framework collects
53:09
together all instances from all maps of
53:12
each key word so the MapReduce framework
53:15
is going to collect together all of the
53:16
A's you know from every map every key
53:20
value pair whose key was a it's gonna
53:22
take collect them all and hand them to
53:30
one call of the programmer to find
53:33
reduce function and then it's gonna take
53:35
all the B's and collect them together of
53:38
course you know requires a real
53:39
collection because they were different
53:42
instances of key B were produced by
53:44
different indications of map on
53:46
different computers so we're not talking
53:48
about data movement I'm so we're gonna
53:50
collect all the B keys and hand them to
53:53
a different call to reduce that has all
53:58
of the B keys as its arguments and same
54:01
as C so there's going to be the
54:07
MapReduce framework will arrange for one
54:09
call to reduce for every key that
54:11
occurred in any of the math output and
54:17
you know for our sort of silly word
54:19
count example all these reduces have to
54:23
do or any one of them has to do is just
54:25
count the number of items passed to it
54:28
doesn't even have to look at the items
54:29
because it knows that each of them is
54:31
the word is responsible for plus one is
54:34
the value you don't have to look at
54:35
those ones we've just count
54:36
so this reduce is going to produce a and
54:41
then the count of its inputs this reduce
54:44
it's going to produce the key associated
54:47
with it and then count of its values
54:50
which is also two so this is what a
54:57
typical MapReduce job looks like the
55:01
high level just for completeness the
55:07
well some a little bit of terminology
55:09
the whole computation is called the job
55:12
anyone invocation of MapReduce is called
55:16
a task so we have the entire job and
55:19
it's made up of a bunch of math tasks
55:21
and then a bunch of produced tasks so
55:27
it's an example for this word count you
55:29
know the what the map and reduce
55:31
functions would look like the map
55:40
function takes a key in the value as
55:45
arguments and now we're talking about
55:46
functions like written in an ordinary
55:48
programming language like C++ or Java or
55:51
who knows what so this is just code
55:54
people ordinary people can write what a
55:57
map function for word count would do is
55:58
split the the key is the file name which
56:02
typically is ignored we really care what
56:05
the file name was and the V is the
56:07
content of this maps input file so V is
56:12
you know just contains all this text
56:14
we're gonna split V into words and then
56:21
for each word
56:30
we're just gonna emit and emit takes two
56:34
arguments mitts you know calmly map can
56:36
make emit is provided by the MapReduce
56:38
framework we get to produce we hand emit
56:41
a key which is the word and a value
56:44
which is the string one so that's it for
56:49
the map function and a word count map
56:53
function and MapReduce literally it
56:54
could be this simple
56:56
so there's sort of promise to make the
57:00
and you know this map function doesn't
57:02
know anything about distribution or
57:04
multiple computers or the fact we need
57:06
we need to move data across the network
57:07
or who knows what
57:09
this is extremely straightforward and
57:13
the reduce function for a word count the
57:19
reduce is called with you know remember
57:21
each reduce is called with sort of all
57:23
the instances of a given key on the
57:25
MapReduce framework calls reduce with
57:27
the key that it's responsible for and a
57:30
vector of all the values that the maps
57:33
produced associated with that key the
57:38
key is the word the values are all ones
57:40
we don't like here about them we only
57:41
care about how many they were and so
57:44
reduce has its own omit function that
57:47
just takes a value to be emitted as the
57:51
final output as the value for the this
57:53
key so we're gonna admit a length of
57:57
this array so this is also about as
58:01
simplest reduce functions have are and
58:04
in Map Reduce namely extremely simple
58:08
and requiring no knowledge about fault
58:11
tolerance or anything else alright any
58:15
questions about the basic framework yes
58:27
[Music]
58:36
you mean can you feed the output of the
58:39
reducers sort of oh yes oh yes in in in
58:48
in real life all right
58:50
in real life it is routine among
58:53
MapReduce users to you know define a
58:55
MapReduce job that took some inputs and
58:58
produce some outputs and then have a
59:00
second MapReduce job you know you're
59:01
doing some very complicated multistage
59:03
analysis or iterative algorithm like
59:08
PageRank for example which is the
59:10
algorithm Google uses to sort of
59:13
estimate how important or influential
59:16
different webpages are that's an
59:18
iterative algorithm is sort of gradually
59:21
converges on an answer and if you
59:22
implement in MapReduce which I think
59:24
they originally did you have to run the
59:26
MapReduce job multiple times and the
59:28
output of each one is sort of you know
59:30
list of webpages with an updated sort of
59:34
value or weight or importance for each
59:36
webpage so it was routine to take this
59:38
output and then use it as the input to
59:40
another MapReduce job oh yeah well yeah
59:53
you need to sort of set things up the
59:56
output you need to rate the reduced
59:58
function sort of in the knowledge that
59:59
oh I need to produce data that's in the
60:02
format or as the information required
60:05
for the next MapReduce job I mean this
60:07
actually brings up a little bit of a
60:09
shortcoming in the MapReduce framework
60:11
which is it's great if you are if the
60:16
algorithm you need to run is easily
60:18
expressible as a math followed by this
60:20
sort of shuffling of the data by key
60:23
followed by a reduce and that's it
60:26
my MapReduce is fantastic for algorithms
60:28
that can be cast in that form and we're
60:30
furthermore each of the maps has to be
60:32
completely independent and
60:33
are required to be functional pure
60:39
functional functions that just look at
60:42
their arguments and nothing else
60:44
you know that's like it's a restriction
60:46
and it turns out that many people want
60:48
to run much longer pipelines that
60:49
involve lots and lots of different kinds
60:51
of processing and with MapReduce you
60:53
have to sort of cobble that together
60:54
from multiple MapReduce distinct
60:58
MapReduce jobs and more advanced systems
61:00
which we will talk about later in the
61:02
course are much better at allowing you
61:04
to specify the complete pipeline of
61:06
computations and they'll do optimization
61:08
you know the framework realizes all the
61:10
stuff you have to do and organize much
61:13
more complicated efficiently optimize
61:15
much more complicated computations
61:39
from the programmers point of view it's
61:41
just about map and reduce from our point
61:44
of view it's going to be about the
61:45
worker processes and the worker servers
61:49
that that are they're part of MapReduce
61:53
framework that among many other things
61:55
call the map and reduce functions so
62:00
yeah from our point of view we care a
62:01
lot about how this is organized by the
62:04
surrounding framework this is sort of
62:06
the programmers view with all the
62:08
distributive stuff stripped out yes
62:15
sorry I gotta say it again oh you mean
62:25
where does the immediate data go okay so
62:32
there's two questions one is when you
62:35
call a MIT what happens to the data and
62:38
the other is where the functions run so
62:46
the actual answer is that first where
62:50
the stuff rotten there's a number of say
62:53
a thousand servers um actually the right
62:56
thing to look at here is figure one in
62:58
the paper sitting underneath this in the
63:02
real world there's some big collection
63:04
of servers and we'll call them maybe
63:09
worker servers or workers and there's
63:12
also a single master server that's
63:14
organizing the whole computation and
63:16
what's going on here is the master
63:18
server for know knows that there's some
63:22
number of input files you know five
63:24
thousand input files and it farms out in
63:27
vacations of map to the different
63:29
workers so it'll send a message to
63:30
worker seven saying please run you know
63:34
this map function on such-and-such an
63:37
input file and then the worker function
63:41
which is you know part of MapReduce and
63:43
knows all about Map Reduce well then
63:47
read the file read the input whatever
63:50
whichever input file and call this map
63:54
function with the file name value as its
63:56
arguments then that worker process will
64:00
employees what implements in it and
64:02
every time the map calls emit the worker
64:05
process will write this data to files on
64:10
the local disk so what happens to map
64:12
emits and is they produce files on the
64:17
map workers local discs that are
64:19
accumulating all the keys and values
64:21
produced by the maps run on that worker
64:26
so at the end of the math phase what
64:30
we're left with is all those worker
64:32
machines each of which has the output of
64:35
some of whatever maps were run on that
64:37
worker machine then the MapReduce
64:42
workers arrange to move the data to
64:45
where it's going to be needed for the
64:46
reduces so and since and a you know in a
64:50
typical big computation you know this
64:53
this reduce indication is going to need
64:55
all map output that
64:59
mentioned the key a but it's gonna turn
65:01
out you know this is a simple example
65:04
but probably in general every single map
65:08
indication will have produce lots of
65:10
keys including some instances of key a
65:12
so typically in order before we can even
65:15
run this reduce function the MapReduce
65:17
framework that is the MapReduce worker
65:20
running on one of our thousand servers
65:22
is going to have to go talk to every
65:24
single other of the thousand servers and
65:26
say look you know I'm gonna run the
65:28
reduce for key a please look at the
65:31
intermediate map output stored in your
65:33
disk and fish out all of the instances
65:35
of key a and send them over the network
65:38
to me so the reduce worker is going to
65:41
do that it's going to fetch from every
65:43
worker all of the instances of the key
65:45
that it's responsible for that the
65:47
master has told it to be responsible for
65:50
and once it's collected all of that data
65:51
then it can call reduce and the reduce
65:55
function itself calls reduce omit which
65:58
is different from the map in it and what
66:01
reduces emit does is writes the output
66:04
to a file in a cluster file service that
66:12
Google uses so here's something I
66:14
haven't mentioned I haven't mentioned
66:17
where the input lives and where the
66:21
output lives they're both files because
66:25
any piece of input we want the
66:28
flexibility to be able to read any piece
66:31
of input on any worker server that means
66:34
we need some kind of network file system
66:36
to store the input data and so indeed
66:42
the paper talks about this thing called
66:44
GFS or Google file system and GFS is a
66:50
cluster file system and BFS actually
66:51
runs on exactly the same set of workers
66:54
that work our servers that run MapReduce
66:56
and the input GFS just automatically
67:00
when you you know it's a file system you
67:02
can read in my files it just
67:03
automatically splits up any big file you
67:06
store on it across lots of servers and
67:08
64 megabyte chunks so if you write
67:12
if you view of ten terabytes of crawled
67:14
web page contents and you just write
67:17
them to GFS even as a single big file
67:20
GFS will automatically split that vast
67:23
amount of data up into 64 kilobyte
67:25
chunks distributed evenly over all of
67:28
the GFS servers which is to say all the
67:30
servers that Google has available and
67:32
that's fantastic that's just what we
67:34
need if we then want to run a MapReduce
67:36
job that takes the entire crawled web as
67:39
input the data is already stored in a
67:42
way that split up evenly across all the
67:44
servers and so that means that the map
67:47
workers you know we're gonna launch you
67:49
know if we have a thousand servers we're
67:51
gonna launch a thousand map workers each
67:53
reading one 1000s at the input data and
67:55
they're going to be able to read the
67:57
data in parallel from a thousand GFS
68:01
file servers thus getting now tremendous
68:04
total read throughput you know the read
68:07
through put up a thousand servers
68:20
so so are you thinking maybe that Google
68:23
has one set of physical machines among
68:25
GFS and a separate set of physical
68:27
machines that run MapReduce jobs okay
68:40
right so the question is what does this
68:44
arrow here actually involve and the
68:48
answer that actually it sort of changed
68:50
over the years as Google's
68:51
involve this system but you know what
68:55
this in those general case if we have
68:58
big files stored in some big Network
69:01
file system like you know it's like GFS
69:02
is a bit like AFS you might have used on
69:05
Athena where you go talk to some
69:07
collection and your data split over big
69:09
collection o servers you have to go talk
69:11
to those servers over the network to
69:12
retrieve your data in that case what
69:14
this arrow might represent is the meta
69:17
MapReduce worker process has to go off
69:20
and talk across the network to the
69:22
correct GFS server or maybe servers that
69:25
store it's part of the input and fetch
69:28
it over the network to the MapReduce
69:30
worker machine in order to pass the map
69:33
and that's certainly the most general
69:35
case and that was eventually how
69:37
MapReduce actually worked in the world
69:40
of this paper though and and if you did
69:44
that that's a lot of network
69:45
communication are you talking about ten
69:47
terabytes of data and we have moved 10
69:49
terabytes across their data center
69:51
network which you know data center
69:54
networks wanting gigabits per second but
69:55
it's still a lot of time to move tens of
69:57
terabytes of data in order to try to and
70:02
indeed in the world of this paper in
70:04
2004 the most constraining bottleneck in
70:07
their MapReduce system was Network
70:08
throughput because they were running on
70:11
a network if you sort of read as far as
70:13
the evaluation section their network
70:18
their network as was they had thousands
70:24
of machines
70:27
whatever and they would collect machines
70:30
they would plug machines and you know
70:32
each rack of machines and you know an
70:35
Ethernet switch for that rack or
70:36
something but then you know they all
70:38
need to talk to each other but there was
70:40
a route Ethernet switch that all of the
70:43
Rockies are net switches talked to and
70:45
this one and you know so if you just
70:47
pick some Map Reduce worker and some GFS
70:51
server you know chances are at least
70:52
half the time the communication between
70:54
them has to pass through this one
70:56
wouldn't switch their routes which had
70:58
only some amount of total throughput
71:01
which I forget you know some number of
71:05
gigabits per second and I forget the
71:09
number well but when I did the division
71:13
that is divided up to the total
71:17
throughput available in the routes which
71:19
by the roughly 2000 servers that they
71:21
used in the papers experiments what I
71:23
got was that each machine share of the
71:26
route switch or of the total network
71:27
capacity was only 50 megabits per second
71:30
per second in their setup 50 megabits
71:36
per second per machine and then might
71:41
seem like a lot 50 megabits gosh
71:43
millions and millions but it's actually
71:45
quite small compared to how fast a disks
71:47
Ron or CPUs run and so this with their
71:51
network this 50 megabits per second was
71:53
like a tremendous limit and so they
71:56
really stood on their heads in the
71:57
design described in the paper to avoid
72:00
using the network and they played a
72:02
bunch of tricks to avoid sending stuff
72:05
over the network when they possibly
72:07
could avoid it one of them was they
72:10
would they ran the gfs servers and the
72:14
MapReduce workers on the same set of
72:16
machines so they have a thousand
72:19
machines they'd run GFS they implement
72:23
their GFS service on that thousand
72:25
machines and run MapReduce on the same
72:27
thousand machines and then when the
72:29
master was splitting up the map work and
72:33
sort of farming it out to different
72:34
workers it would cleverly when it was
72:39
about to run the map that was going to
72:41
read from input file one it would figure
72:44
out from GFS which server actually holds
72:47
input file one on its local disk and it
72:50
would send the map for that input file
72:53
to the MapReduce software on the same
72:55
machine so that by default this arrow
72:59
was actually local local read from the
73:01
local disk and did not involve the
73:03
network and you know depending on
73:05
failures or load or whatever that
73:07
couldn't always do that but almost all
73:10
the maps would be run on the very same
73:11
machine and stored the data thus saving
73:13
them vast amount of time that they would
73:17
otherwise had to wait to move the input
73:19
data across the network the next trick
73:22
they played is that map as I mentioned
73:26
before stores this output on the local
73:28
disk of the machine that you run the map
73:29
on so again storing the output of the
73:31
map does not require network
73:33
communication he's not immediately
73:35
because the output stored in the disk
73:38
however we know for sure that one way or
73:42
another in order to group together all
73:45
of you know by the way the MapReduce is
73:46
defined in order to group together all
73:49
of the values associated with the given
73:51
key and pass them to a single invocation
73:55
to produce on some machine this is going
73:57
to require network communication we're
73:59
gonna you know we want to need to fetch
74:02
all bays and give them a single
74:03
machine that have to be moved across the
74:05
network and so this shuffle this
74:08
movement of the keys from is kind of
74:11
originally stored by row and on the same
74:14
machine that ran the map we need them
74:16
essentially to be stored on by column on
74:18
the machine that's going to be
74:19
responsible for reduce this
74:22
transformation of row storage
74:23
essentially column storage is called the
74:25
paper calls a shuffle and it really that
74:28
required moving every piece of data
74:30
across the network from the map that
74:33
produced it to the reduce that would
74:34
need it and now it's like the expensive
74:36
part of the MapReduce yeah
74:51
you're right you can imagine a different
74:53
definition in which you have a more kind
74:55
of streaming reduce I don't know I
74:57
haven't thought this through I don't
75:00
know why whether that would be feasible
75:02
or not certainly as far as programmer
75:04
interface like if the goal their
75:06
number-one goal really was to be able to
75:09
make it easy to program by people who
75:11
just had no idea of what was going on in
75:13
the system so it may be that you know
75:16
this speck this is really the way reduce
75:18
functions look and you know in C++ or
75:22
something like a streaming version of
75:24
this is now starting to look I don't
75:28
know how it look probably not this
75:30
symbol but you know maybe it could be
75:33
done that way and indeed many modern
75:35
systems people got a lot more
75:37
sophisticated with modern things that
75:41
are the successors the MapReduce and
75:43
they do indeed involve processing
75:45
streams of data often rather than this
75:48
very batch approach there is a batch
75:50
approach in the sense that we wait until
75:52
we get all the data and then we process
75:54
it so first of all that you then have to
75:57
have a notion of finite inputs right
75:59
modern systems often do indeed you
76:02
streams and and are able to take
76:05
advantage of some efficiencies do that
76:08
MapReduce okay so this is the point at
76:15
which this shuffle is where all the
76:17
network traffic happens this can
76:19
actually be a vast amount of data so if
76:21
you think about sort if you're sorting
76:23
the the output of the sort has the same
76:26
size as the input to the sort so that
76:29
means that if you're you know if your
76:30
input is 10 terabytes of data and you're
76:32
running a sort you're moving 10
76:34
terabytes of data across a network at
76:36
this point and your output will also be
76:38
10 terabytes and so this is quite a lot
76:40
of data and then indeed it is from any
76:42
MapReduce jobs although not all there's
76:44
some that significantly reduce the
76:46
amount of data at these stages somebody
76:49
mentioned Oh what if you want to feed
76:51
the output of reduce into another
76:52
MapReduce job and indeed that was often
76:55
what people wanted to do and
76:56
in case the output of the reduce might
76:58
be enormous like four sort or web and
77:00
mixing the output of the produces on ten
77:03
terabytes of input the output of the
77:05
reduces again gonna be ten terabytes so
77:07
the output of the reduce is also stored
77:09
on GFS and the system would you know
77:12
reduce would just produce these key
77:13
value pairs but the MapReduce framework
77:18
would gather them up and write them into
77:20
giant files on GFS and so there was
77:23
another round of network communication
77:27
required to get the output of each
77:30
reduce to the GFS server that needed to
77:33
store that reduce and because you might
77:35
think that they could have played the
77:37
same trick with the output of storing
77:39
the output on the GFS server that
77:42
happened to run the MapReduce worker
77:46
that ran the reduce and maybe they did
77:48
do that but because GFS as well as
77:51
splitting data for performance also
77:53
keeps two or three copies for fault
77:55
tolerance that means no matter what you
77:58
need to write one copy of the data
77:59
across a network to a different server
78:01
so there's a lot of network
78:03
communication here and a bunch here also
78:05
and I was this network communication
78:08
that really limited the throughput in
78:09
MapReduce
78:10
in 2004 in 2020 because this network
78:17
arrangement was such a limiting factor
78:19
for so many things people wanted to do
78:21
in datacenters modern data center
78:23
networks are a lot faster at the root
78:26
than this was and so you know one
78:28
typical data center network you might
78:30
see today actually has many root instead
78:32
of a single root switch that everything
78:34
has to go through you might have you
78:37
know many root switches and each rack
78:40
switch has a connection to each of these
78:42
sort of replicated root switches and the
78:44
traffic is split up among the root
78:46
switches so modern data center networks
78:48
have far more network throughput and
78:52
because of that actually modern I think
78:54
Google sort of stopped using MapReduce a
78:57
few years ago but before they stopped
79:00
using it the modern MapReduce actually
79:02
no longer tried to run the maps on the
79:04
same machine as the data stored on they
79:06
were happy to vote the data from
79:08
anywhere because they just assumed that
79:11
was extremely fast okay we're out of
79:16
time for MapReduce
79:18
we have a lab due at the end of next
79:21
week
79:22
in which you'll write your own somewhat
79:24
simplified MapReduce so have fun with
79:27
that
79:28
and see you on Thursday