字幕記錄


00:00
hey the TAS are going to be giving a
00:02
lecture on concurrency and go basically
00:06
this lecture is going to be full of
00:07
design patterns and practical tips to
00:10
help you with the labs we're going to be
00:13
covering briefly the code memory model
00:15
the reading which we went over and then
00:17
spend most of the lecture talking about
00:18
concurrency primitives and go
00:20
concurrency patterns and go how you do
00:22
things that you will need to do in the
00:24
labs and then finally we'll talk through
00:26
some debugging tips and techniques and
00:27
show you some interesting tools that you
00:29
might want to use when debugging the
00:31
labs so very briefly on the go memory
00:34
model on the reading so why did we
00:36
assign this reading well the goal was to
00:38
give you some concrete examples of
00:40
correct ways to write threaded code and
00:43
go so the document like in the second
00:45
half of the document has some examples
00:46
of crack code and an incorrect code and
00:48
how it can go wrong so one thing you
00:51
might have noticed in the document is
00:52
early on it says if you need to read and
00:54
understand this you're being too clever
00:56
and we think that that's good advice so
00:58
focus on how to write correct code don't
01:01
focus way too much on the happens before
01:03
relation and being able to reason about
01:04
exactly why incorrect code is incorrect
01:06
like we don't really care we just want
01:08
to be able to write correct code and
01:09
call it a day
01:10
one question that came up in the lecture
01:13
questions was like talking about
01:16
goroutines in relation to performance
01:18
and so we just wanted to say that
01:20
goroutines and like in general
01:22
concurrency can be used for a couple
01:24
different reasons and the reason we use
01:26
concurrency in the labs is not
01:28
necessarily for performance like we're
01:30
not going for parallelism using multiple
01:33
cores on a single machine in order to be
01:34
able to do more work on the CPU
01:36
concurrency gets us something else
01:38
besides performance through parallelism
01:40
it can get us better expert Civet ii
01:42
like we want to write down some ideas
01:43
and it happens to be that writing down
01:46
code that uses threads is a clean way of
01:48
expressing those ideas and so the
01:51
takeaway from that is when you use
01:52
threads in lab 2 and Beyond don't try to
01:55
do fancy things you might do if you're
01:57
going for performance especially CPU
01:59
performance like we don't care to do
02:00
things like using fine-grained locking
02:02
or other techniques use basically write
02:06
code that's easy to reason about use big
02:08
locks to protect large critical sections
02:10
and just like don't worry about
02:12
performance in the sense of CPU
02:13
performance
02:16
so with that that's all we're going to
02:18
say about the memory model and spend
02:20
most of this lecture just talking about
02:22
go code and go concurrency patterns and
02:25
as we go through these examples feel
02:27
free to ask any questions about what's
02:28
on the screen or anything else you might
02:29
think about so I'm going to start off
02:32
talking about concurrency primitives and
02:34
go so the first thing is closures this
02:37
is something that will almost certainly
02:38
be helpful in the labs and this is
02:41
related to go routines so here's this
02:44
example program on the screen and what
02:46
it does is the main function declares a
02:48
bunch of variables and then spawns this
02:50
go routine in here with this go
02:51
statement and we noticed that the score
02:53
routine is not taking it as an argument
02:55
a function call to some function defined
02:57
elsewhere but this anonymous function
02:59
just defined in line here so this is a
03:01
handy pattern this is something called a
03:02
closure and one neat thing about this is
03:04
that this function that's defined here
03:06
can refer to variables from the
03:08
enclosing scope so for example this
03:10
function can mutate this variable a
03:12
that's defined up here or refer to this
03:15
weight group that's defined up here so
03:19
if we go run this example it does what
03:23
you think it does
03:24
the weight group dot done here let's the
03:27
main thread continue past this point it
03:29
prints out this variable which has been
03:30
mutated by this concurrently running
03:32
thread that finished before this weight
03:34
happened so this is a useful pattern to
03:38
be able to use one like the reason we're
03:41
pointing this out is because you might
03:46
have code that looks like this in your
03:47
labs very similar to the previous
03:50
example except this is code that is
03:52
spawning a bunch of threads in a loop
03:53
this is useful for example when you want
03:55
to send our pcs in parallel right so
03:58
like in lab two if you have a candidate
04:00
asking for votes you want to ask for
04:02
votes from all the followers in parallel
04:04
not one after the other because the RPC
04:06
is a blocking operation that might take
04:07
some time or similarly the leader might
04:09
want to send append entries to all the
04:10
followers you want to do it in parallel
04:12
not in series and so threads are a clean
04:14
way to express this idea and so you
04:17
might have code that looks kind of like
04:18
this at a high level in a for loop you
04:20
spawn a bunch of go routines one thing
04:23
to be careful about here this is
04:24
something that was talked about in a
04:26
previous lecture
04:27
is identifier capture and goroutines and
04:31
mutation of that identifier in the outer
04:33
scope so we see here that we have this
04:35
eye that's being mutated by this for
04:37
loop and then we want to use that value
04:39
inside the square root een and the way
04:41
we do that like the correct way of
04:42
writing this code is to pass this value
04:44
I as an argument to this function and
04:46
this function or you can rename it to X
04:49
inside here and then use the value
04:50
inside and so if we run this program
04:53
so here I've kind of stubbed out to send
04:55
our PC thing was actually just prints
04:57
out the index this I might be like the
04:59
index of the follower trying to send an
05:01
RPC to here prints out the numbers 0
05:04
through 4 in some order so this is what
05:06
we want like send our PCs to all the
05:07
followers the reason we're showing you
05:09
this code is because there's a variation
05:11
of this code which looks really similar
05:13
and maybe intuitively you might think it
05:15
does the right thing but in fact it
05:16
doesn't so in this code the only thing
05:18
that's changed is we've gotten rid of
05:21
this argument here that we're explicitly
05:23
passing and instead we're letting this I
05:26
refer to the eye from the outer scope so
05:29
you might think that when you run this
05:30
it does the same thing but in fact in
05:33
this particular run it printed 4 5 5 5 5
05:35
so this would do the wrong thing and the
05:39
reason for this is that this I is being
05:41
mutated by this outer scope and by the
05:43
time this go routine ends up actually
05:45
executing this line well the for loop
05:47
has already changed the value of I so
05:49
this doesn't do the right thing so at a
05:52
high level if you're spawning goroutines
05:54
in a loop just make sure that you use
05:57
this pattern here and everything will
06:00
work right any questions about that
06:03
so it's just like a small gotcha but
06:05
we've seen this a whole bunch of times
06:06
in office hours so I just wanted to
06:07
point this out all right so moving on to
06:10
other patterns that you might want to
06:13
use in your code oftentimes you want
06:17
code that periodically does something a
06:19
very simple way to do that is to have a
06:22
separate function that in an infinite
06:24
loop does something in this case we're
06:26
just printing out tick and then use this
06:29
time dot sleep to wait for a certain
06:31
amount of time so very simple pattern
06:34
here
06:34
you don't need anything fancier than
06:36
this to do something periodically
06:41
one modification of this that you might
06:43
want is you want to do something
06:44
periodically until something happens for
06:47
example you might want to start up a
06:49
raft here and then periodically send
06:51
heartbeats but when we call dot kill on
06:54
the raft instance you want to actually
06:56
shut down all these goroutines so you
06:58
don't have all these random goroutines
06:59
still running in the background and so
07:01
the pattern for that looks something
07:03
like this you have a goroutine that will
07:08
run in an infinite loop and do something
07:10
and then wait for a little bit and then
07:13
you can just have a shared variable
07:14
between whatever control thread is going
07:16
to decide whether this goroutine should
07:18
die or not so in this example we have
07:20
this variable done that's a global
07:21
variable and what main does is it waits
07:23
for awhile and sets done to true and in
07:26
this go routine that's ticking and doing
07:28
work periodically we're just checking
07:31
the value of done and if done is set
07:32
then we terminate the square-root eeen
07:34
and here since done is a shared variable
07:38
being mutated and read by multiple
07:40
threads we need to make sure that we
07:41
guard the use of this with a lock so
07:44
that's where this mute outlaw can mute
07:45
it unlock comes in for the purpose of
07:49
the labs you can actually write
07:50
something a little bit simpler than this
07:51
so we have this method RF killed on your
07:55
raft instance so you might have code
07:57
that looks a little bit more like this
07:58
so while you're wrapped instance is not
08:01
dead you want to periodically do some
08:02
work any questions about that so far
08:10
yeah question
08:21
does using the locking mechanisms for
08:25
channels make it so that any right
08:30
stunts any variables and those functions
08:32
are to be observed by the fencer would
08:36
you need to send done across the channel
08:41
okay so let me try to simplify the
08:44
question a bit I think the question is
08:45
do you need to use locks here can you
08:47
use channels instead and R and can you
08:51
get away with not using locks and like
08:52
what's the difference between nothing
08:53
versus channels vs locks is that
08:55
basically what you're asking I think the
09:08
question is this done does it not need
09:10
to be sent across a channel does just
09:12
using these locks ensure that this read
09:15
here observes the write done by a thread
09:18
okay so the answer is yes basically at a
09:21
high level if you want to ensure cross
09:23
thread communication make sure you use
09:25
go synchronization primitives whether
09:27
it's channels or locks and condition
09:30
variables and so here because of the use
09:33
of locks after this thread writes done
09:35
and does unlock the next lock that
09:38
happens is guaranteed to observe the
09:40
writes done before that before this
09:42
unlock happened so you have this write
09:44
happened and this unlock happened then
09:46
one of these locks happens and then the
09:48
next done will be guaranteed to observe
09:49
that write of true question
10:02
that's a good question in this
10:05
particular code it doesn't matter but it
10:07
would be cleaner to do it so the
10:08
question is why don't we do mu dot
10:10
unlock here before returning and the
10:14
answer is in here there's no more like
10:16
the program's done so it doesn't
10:17
actually end up mattering but you're
10:18
right that like in general we would want
10:21
to ensure that we unlock before we
10:23
return yeah thanks for pointing that out
10:41
so I'm not sure entirely what the
10:43
question is but maybe something like can
10:44
both of these acquire the lock at the
10:46
same time is that the question and we'll
11:00
talk a little bit more about locks in
11:03
just a moment but at a high level the
11:04
semantics of a lock are the lock is
11:07
either held by somebody or not held by
11:09
somebody and if it's not held by
11:11
somebody then if someone calls lock they
11:12
have the chance to acquire the lock and
11:14
if before they call unlock somebody else
11:17
calls lock that other thread is going to
11:18
be blocked until the unlock happens then
11:20
the lock is free again so at a high
11:22
level between the lock and the unlock
11:25
for any particular lock like any only a
11:27
single thread can be executing what's
11:29
called a critical section between the
11:30
lock and unlock regions any other
11:35
questions
12:02
so the question is related to timing
12:05
like when you set done equals true and
12:08
then you unlock you have no guarantee in
12:09
terms of real time like when periodic
12:11
will end up being scheduled and observe
12:13
this right and actually end up
12:14
terminating and so yes if you want to
12:17
mean to actually ensure that periodic
12:19
has exited for some particular reason
12:20
then you could write some code that
12:22
communicates back from periodic
12:24
acknowledging this but in this
12:25
particular case like the only reason we
12:27
have the sleep here is just to
12:28
demonstrate that the sleep here is just
12:35
to demonstrate that tick prints for a
12:37
while and then periodic as indeed cancel
12:39
it because it stops being printed before
12:41
I get my shell prompt back and in
12:44
general for a lot of these background
12:45
threads like you can just say that you
12:47
want to kill them and it doesn't matter
12:48
if they're killed within 1 second or
12:49
within 2 seconds or one exactly go
12:51
schedules it because this thread is
12:52
going to just observe this right to done
12:54
and then exit do no more works it
12:56
doesn't really matter and also another
12:58
thing in go is that if you spawn a bunch
13:00
of goo routines one of them is the main
13:02
go routine this one here and the way go
13:04
works is that if the main goroutine
13:06
exits the whole program terminates and
13:08
all go routines are terminated
13:27
that's a great question okay so I think
13:31
the question is something like why do
13:32
you need locks at all like can you just
13:33
delete all the locks and then like
13:36
looking at this code it looks like okay
13:37
main does a right to true at some point
13:39
and periodic is repeatedly reading it so
13:41
at some point it should observe this
13:43
read right well it turns out that like
13:45
this is why go has this fancy memory
13:48
model and you have this whole thing on
13:49
that happens before relation the
13:51
compiler is allowed to take this code
13:53
and emit a kind of low-level machine
13:55
code that does something a little bit
13:57
different than what you intuitively
13:58
thought would happen here and we can
14:00
talk about that in detail offline after
14:02
the lecture and office hours but at a
14:05
high level I think one rule you can
14:07
follow is if you have accesses to shared
14:09
variables and you want to be able to
14:10
observe them across different threads
14:12
you need to be holding a lock before you
14:14
read or write those shared variables in
14:15
this particular case I think the go
14:18
compiler would be allowed to optimize
14:19
this to like lift the read of done
14:21
outside the four so read this shared
14:24
variable once and then if done is false
14:27
then set like make the inside be an
14:29
infinite loop because like now the way
14:32
this thread is written it had uses no
14:34
synchronization primitives there's no
14:35
mutex lock or unlock no channel sends or
14:37
receives and so it's actually not
14:39
guaranteed to observe any mutations done
14:41
by other concurrently running threads
14:43
and if you look on Piazza I've actually
14:45
like written a particular go program
14:47
that is optimized in the unintuitive way
14:49
like it'll produce code that does an
14:50
infinite loop even though looking at it
14:52
like you might think that oh the obvious
14:54
way to compile this code will produce
14:56
something that terminates yeah so the
15:00
memory model is pretty fancy and it's
15:02
really hard to think about why exactly
15:04
incorrect programs are incorrect but if
15:06
you follow some general rules like whole
15:08
blocks before you mutate shared
15:10
variables then you can avoid thinking
15:11
about some of these nasty issues
15:15
any other questions all right so let's
15:19
talk a little bit more about mutexes now
15:21
so why do you need mutex is at a high
15:25
level whenever you have concurrent
15:28
access but by different threads to some
15:30
shared data you want to ensure that
15:32
reads and writes of that data are atomic
15:35
so here's one example of program that
15:38
declares a counter and then spawns a
15:40
goroutine
15:40
actually spawns a thousand go routines
15:42
that each update the counter value and
15:44
increment it by one and you might think
15:46
that looking at this intuitively when I
15:48
print out the value of the counter at
15:49
the end it should print a thousand but
15:51
it turns out that we missed some of the
15:53
updates here and in this particular case
15:55
it only printed 947 so what's going on
15:59
here is that this update here is not
16:03
really protected in any way and so these
16:05
threads running concurrently can read
16:06
the value of counter and update it and
16:08
clobber other threads updates of this
16:11
value like basically we want to ensure
16:13
that this entire section here happens
16:16
atomically and so the way you make
16:18
blocks of code run atomically are by
16:22
using locks and so in this code example
16:25
we've fixed this bug we create a lock
16:29
and then all these go routines that
16:30
modify this counter value first grab the
16:33
lock then update the counter value and
16:36
then unlock and we see that we're using
16:37
this defer keyword here what this does
16:39
is basically the same as putting this
16:42
code down here so we grab a lock do some
16:45
update then unlock defer is just a nice
16:47
way of remembering to do this you might
16:51
forget to write the unlock later and so
16:53
what defer does is it you can think of
16:54
it as like scheduling this to run at the
16:56
end of the current function body and so
16:59
this is a really common pattern you'll
17:00
see for example in your RPC handlers for
17:02
the lab so oftentimes RPC handlers will
17:05
manipulate either read or write data on
17:08
the RAF structure right and those
17:10
updates should be synchronized with
17:12
other concurrently happening updates and
17:13
so oftentimes the pattern for RPC
17:15
handles would be like grab the lock
17:17
differ unlock and then go do some work
17:19
inside so we can see if we run this code
17:26
it produces the expected results so it
17:29
prints out a thousand and we haven't
17:30
lost any of these updates and so what at
17:34
a high level what a lock or a mutex can
17:36
do is guarantee mutual exclusion for a
17:38
region of code which we call a critical
17:40
section so in here this is the critical
17:41
section and it ensures that none of
17:43
these critical sections execute
17:45
concurrently with
17:46
ones they're all serialized happened one
17:47
after another question
18:00
yes so this is a good observation this
18:03
particular could is actually not
18:04
guaranteed to produce a thousand
18:05
depending on how thread scheduling end
18:07
up ends up happening because all the
18:08
mean guru teen does is it waits for one
18:10
second which is some arbitrary unit of
18:12
time and then it prints out the value of
18:14
the counter I just want to keep this
18:15
example as simple as possible a
18:17
different way to write this code that
18:18
would be guaranteed to print a thousand
18:20
would be to have the main goroutine wait
18:22
for all these thousand threads to finish
18:24
so you could do this using a weight
18:25
group for example but we didn't want to
18:26
put two synchronization primitives like
18:28
weight groups and mutex is in the same
18:29
example so that's why we're at this code
18:31
that is like technically incorrect but I
18:33
think it still demonstrates the point of
18:34
locks any other questions great so at a
18:43
very high level you can think of locks
18:44
is like you grab the lock you mutate the
18:46
shared data and then you unlock so does
18:49
this pattern always work well turns out
18:52
that that's like a useful starting point
18:57
for how to think about locks but it's
18:58
not really the complete story so here's
19:01
some code this doesn't fit on the screen
19:03
but I'll explain it to you we can scroll
19:04
through it it basically implements a
19:06
bank at a high level so I have Alice and
19:08
Bob who both start out with some
19:10
balances and then I keep track of what
19:12
the total balances like the total amount
19:14
of money I store in my bank and then I'm
19:16
going to spawn to go routines that will
19:18
transfer money back and forth between
19:19
our Alice and Bob so this one girl
19:21
routine that a thousand times will
19:23
reduce one from Alice and send it to Bob
19:26
and concurrently running I have this
19:28
other go routine that in a loop will
19:30
reduce one from Bob and send it to Alice
19:31
and notice that I have this mutex here
19:35
and whenever I manipulate these shared
19:38
variables between these two different
19:39
threads
19:39
I'm always locking the mutex and this
19:42
update only happens while this lock is
19:44
held right and so is this code correct
19:49
or incorrect there actually isn't really
19:53
a straightforward answer to that
19:55
question it depends on like what are the
19:57
semantics of my bank like what behavior
19:59
do I expect so I'm going to introduce
20:03
another thread here I'll call this one
20:04
the audit thread and what this is going
20:06
to do is every once in a while I'll
20:07
check it check the sum of all the
20:09
accounts in my bank and make sure that
20:10
the sum is the same as what it started
20:12
out as
20:13
right click if I only allow transfers
20:14
within my bank the total amount should
20:15
never change so now given this other
20:18
thread so what this does is it grabs the
20:20
lock then sums up Alice Plus Bob and
20:22
compares it to the total and if it
20:24
doesn't match then it says that though
20:25
I've observed some violation that my
20:27
total is no longer what it should be if
20:34
I run this code I actually see that a
20:36
whole bunch of times
20:37
this concurrently running thread does
20:39
indeed observe that Alice Plus Bob is
20:41
not equal to the overall sum so what
20:43
went wrong here like we're following our
20:45
basic rule of whenever we're accessing
20:47
data that's shared between threads we
20:49
grab a lock it is indeed true that no
20:52
updates to these shared variables happen
20:54
while the lock is not held exactly so
21:15
let me repeat that for everybody to hear
21:16
what we intended here was for this
21:19
decrement and increment to happen
21:21
atomically but instead of what we ended
21:23
up writing was code that decrement
21:25
atomically and then increments
21:27
atomically and so in this particular
21:28
code actually like we won't lose money
21:30
in the long term like if we let these
21:32
threads run and then wait till they
21:34
finish and then check the total it will
21:36
indeed be what it started out as but
21:37
while these are running since this
21:39
entire block of code is not atomic we
21:41
can temporarily observe these violations
21:44
and so at a higher level the way should
21:46
think about locking is not just like
21:49
locks are to protect access to shared
21:51
data but locks are meant to protect
21:53
invariants you have some shared data
21:55
that multiple people might access and
21:57
there's some properties that hold on
21:58
that shared data like for example here I
22:00
is the programmer decided that I want
22:02
this property that alice + Bob should
22:04
equal some constant and that should
22:05
always be that way I want that property
22:07
to hold but then it may be the case that
22:09
different threads running concurrently
22:10
are making changes to this data and
22:12
might temporarily break this invariant
22:14
here right like here when I decrement
22:16
from Alice temporarily the sum Alice
22:19
Plus Bob has changed but then this
22:20
thread eventually ends up restoring this
22:22
invariant here and so locks are meant to
22:25
protect and vary
22:26
at a high level you grab a lock then you
22:27
do some work that might temporarily
22:29
break the invariant but then you restore
22:31
the invariant before you release the
22:32
lock so nobody can observe these in
22:34
progress updates and so the correct way
22:36
to write this code is to actually have
22:38
less use of lock and unlock
22:39
we have lock then we do a bunch of work
22:41
and then we unlock and when you run this
22:43
code we see no more printouts like this
22:48
that we never have this audit thread
22:50
observe that the total is not what it
22:52
should be all right so that's the right
22:55
way to think about locking at kind of a
22:58
high level you can think about it as
23:00
make sure you grab locks when every
23:02
access shared data like that is a rule
23:03
but another important rule is locks
23:06
protect invariants so grab a lock
23:09
manipulate things in a way that might
23:10
break the invariants but restore them
23:12
afterwards and then release the lock
23:15
another way you can think about it is
23:17
locks can make regions of code atomic
23:19
not just like single statements or
23:21
single updates to shared variables any
23:26
questions about that great so the next
23:30
synchronization primitive we're going to
23:32
talk about it something called condition
23:34
variables and this is it seems like
23:37
there's been a source of confusion from
23:38
lab one where we mentioned condition
23:39
variables but didn't quite explain them
23:40
so we're going to take the time to
23:41
explain them to you now and we're going
23:43
to do that in the context of an example
23:45
that you should all be familiar with
23:47
counting votes so remember in lab 2a you
23:51
have this pattern where whenever Raph
23:53
Pierre becomes a candidate it wants to
23:54
send out vote requests all of its
23:56
followers and eventually the followers
23:58
come back to the candidate and say yes
24:01
or no like whether or not the candidate
24:02
got the vote right and one way we could
24:04
write this code is have the candidate in
24:06
serial ask Pierre number one car number
24:09
two for number three and so on but
24:10
that's bad right because we want the
24:12
candidate ask all the peers in parallel
24:14
so it can quickly win the election when
24:15
possible and then there's some other
24:17
complexities there like when we ask all
24:19
the peers in parallel we don't want to
24:21
wait so we get a response from all of
24:22
them before making up our mind right
24:24
because if a candidate gets a majority
24:25
of votes like it doesn't need to wait
24:27
till it hears back from everybody else
24:29
so this code is kind of complicated in
24:31
some ways and so here here's a kind of
24:34
stubbed out version of what that vote
24:37
counting code might look like
24:39
with a little bit of infrastructure to
24:40
make it actually run and so here have
24:41
this mean goroutine that sets count
24:43
which is like the number of yes votes I
24:44
got to zero and finish to zero finished
24:47
as the number of responses I've gotten
24:48
in total and the idea is I want to send
24:50
out vote requests in parallel and keep
24:52
track of how many yeses I've got and how
24:54
many responses I've gotten in general
24:55
and then once I know whether I've won
24:58
the election or whether I know that I've
24:59
lost the election then I can determine
25:01
that and move on and like the real raft
25:03
code you actually do whatever you need
25:05
to do don't step up to a leader or to
25:07
step down to a follower after you have
25:10
the result from this and so looking at
25:12
this code here I'm going to in parallel
25:15
spun say I have ten peers in parallel
25:17
spawn ten goroutines
25:18
here I pass in this closure here and I'm
25:20
gonna do is request a vote and then if I
25:23
get the vote I'm going to increment the
25:24
count by one and then I'm also going to
25:26
increment this finished by one so like
25:28
this is a number of yeses this is total
25:30
number of responses I've gotten and then
25:32
outside here in the main go routine what
25:34
I'm doing is keeping track of this
25:35
condition I'm waiting for this condition
25:36
to become true that either I have enough
25:38
yes votes that I've won the election or
25:40
I've heard back from enough peers and I
25:42
know that I've lost and so I'm just
25:44
going to in a in a loop check to see and
25:47
wait until count is greater than or
25:49
equal to five or wait until finished is
25:51
equal to ten and then after that's the
25:53
case I can either determine that I've
25:54
lost drive one so does anybody see any
25:56
problems with this code given what we
25:58
just talked about about mutexes yes
26:04
yeah exactly
26:06
countin finished aren't protected by
26:07
mutexes so one thing we certainly need
26:10
to fix here is that whenever we have
26:13
shared variables we need to protect
26:15
access with new taxes and so that's not
26:17
too bad to fix here I declare mutex
26:21
that's accessible by everybody and then
26:23
in the go routines I'm launching in
26:25
parallel to request votes I'm going to
26:27
and this this pattern here is pretty
26:29
important I'm going to first request a
26:30
vote while I'm not holding the lock and
26:33
then after wear that I'm going to grab
26:34
the lock and then update these shared
26:35
variables and then outside I have the
26:40
same patterns as before except I make
26:41
sure to lock and unlock between reading
26:43
these shared variables so in an infinite
26:45
loop I grab the lock and check to see if
26:48
the results of the election have been
26:49
determined by this point and if not I'm
26:51
going to keep running in this infinite
26:52
loop otherwise I'll unlock and then do
26:57
what I need to do outside of here and so
27:01
if I run this example whoops it seems to
27:09
work and this is actually like a correct
27:12
implementation it does the right thing
27:14
but there's some problems with it so can
27:16
anybody recognize any problems with this
27:18
implementation I'll give you a hint this
27:22
code is not as nice as it could be
27:32
so not quite it's going to wait for
27:35
exactly the right amount of time the
27:37
issue here is that it's busy waiting
27:39
what it's doing is in a very tight loop
27:41
it's grabbing the lock checking this
27:43
condition unlocking grabbing this lock
27:45
checking this condition unlocking and
27:47
it's going to burn up 100% CPU on one
27:49
core while it's doing this so this code
27:51
is correct but it's like at a high level
27:54
we don't care about efficiency like CPU
27:55
efficiency for the purpose of the labs
27:57
but if you're using a hundred percent of
27:59
one core you might actually slow down
28:01
the rest of your program enough that it
28:02
won't make progress and so that's why
28:05
this pattern is bad that we're burning
28:06
up a hundred percent CPU waiting for
28:08
some condition to become true right so
28:10
does anybody have any ideas for how we
28:12
could fix this so here's one simple
28:18
solution
28:18
I will change a single line of code all
28:23
I've added here is wait for 50
28:25
milliseconds and so this is a correct
28:28
transformation of that program and it
28:30
kind of seems to solve the problem right
28:32
like before I was burning up a hundred
28:33
percent CPU now only once every 50
28:36
milliseconds I'm going to briefly wake
28:37
up check this condition and go back to
28:39
sleep
28:40
if it doesn't hold and so this is like
28:43
basically a working solution any
28:46
questions so this kind of sorta works
28:51
but one thing you should always be aware
28:53
of whenever you write code is magic
28:55
constants why is this 50 milliseconds
28:58
why not a different number like whenever
29:00
you have an arbitrary number in your
29:01
code it's a sign that you're doing
29:02
something that's not quite right or not
29:04
quite as clean as it could be and so it
29:07
turns out that there's a concurrency
29:08
primitive designed to solve exactly this
29:10
problem of I have some threads running
29:12
concurrently that are making updates to
29:15
some shared data and then I have another
29:17
thread that's waiting for some property
29:19
some condition on that shared data to
29:21
become true and until that condition
29:22
becomes true the thread is just going to
29:24
wait there's a tool designed exactly to
29:26
solve this problem and that's a tool
29:28
called a condition variable and the way
29:33
you use a condition variable is the
29:36
pattern basically looks like this so we
29:38
have our lock from earlier condition
29:40
variables are associated with locks so
29:43
we have some shared data some a lock
29:46
that protects that shared data and then
29:48
we have this condition variable that is
29:49
given a pointer to the lock when it's
29:51
initialized and we're going to use this
29:53
condition variable for kind of
29:54
coordinating when a certain condition
29:56
some property on that shared data when
29:58
that becomes true and the way we modify
30:02
our code is like we have two places one
30:05
we're making changes to that data which
30:07
might make the condition become true and
30:08
then we have another place where we're
30:10
waiting for that condition to become
30:11
true and the general pattern is whenever
30:14
we do something that changes the data we
30:17
call a conduct broadcast and we do this
30:20
while holding the lock and then on the
30:22
other side where we're waiting for some
30:24
condition on that share data to become
30:25
true we call con dot wait and so what
30:28
this does is like let's think about what
30:30
happens in the mean thread for a moment
30:32
the main thread grabs the lock it checks
30:34
this condition suppose it's false it
30:36
calls con dot wait what this will do is
30:38
it will atomically you can think of it
30:40
as it'll release the lock in order to
30:42
let other people make progress and it'll
30:44
add its thread like it'll add itself to
30:46
a like list of people who are waiting on
30:48
this condition variable then
30:50
concurrently one of these threads might
30:52
be able to acquire the lock after it's
30:54
gotten a vote and then it manipulates
30:56
these variables and then it calls
30:58
conduct broadcast what that does is it
31:00
wakes up whoever's waiting on the
31:03
condition variable and so once this
31:05
thread unlocks the mutex this one what
31:08
do we want as it's returning from wait
31:10
we'll reacquire the mutex and then
31:13
return to the top of this for loop which
31:15
is checking this condition so this
31:18
broadcast wakes up whoever's waiting at
31:20
this wait and so this avoids having to
31:25
have that time dot sleeve for some
31:27
arbitrary amount of time like this
31:29
thread that's waiting for some condition
31:30
to become true only gets woken up when
31:32
something changes that might make that
31:34
condition become true right like if you
31:36
think about these threads if they're
31:37
very slow and they don't call conned out
31:40
broadcast for a long time this one will
31:42
just be waiting it won't be like
31:43
periodically waking up and checking some
31:44
condition that can't have changed
31:46
because nobody else manipulated their
31:48
shared data so any questions about this
31:52
pattern yeah
32:16
so that's a great question I think
32:17
you're referring to something called the
32:19
lost wake up problem and this is a topic
32:21
in operating systems and we won't talk
32:23
about it in detail now there feel free
32:24
to ask me after lecture but at a high
32:26
level you can avoid funny race
32:28
conditions that might happen between
32:29
wait and broadcast by following the
32:31
particular pattern I'm showing here and
32:32
I'll show you an abstracted version of
32:33
this pattern in a moment basically the
32:36
pattern is for the side that might make
32:39
changes that will change the outcome of
32:41
the condition test you always lock then
32:44
manipulate the data then call broadcast
32:47
and call unlock afterwards so the
32:49
broadcast must be called while holding
32:51
the lock similarly when you're checking
32:53
the condition you grab the lock then
32:55
you're always checking the condition in
32:56
a loop and then inside so when that
32:59
condition is false you call Condit wait
33:01
this is only called while you're holding
33:03
the lock and it atomically releases the
33:05
lock and kind of schedule like puts
33:07
itself in a list of waiting threads and
33:09
then as waits returning so as we like
33:12
return from this wait call and then go
33:13
back to the top of this for loop it will
33:15
reacquire the lock so this check will
33:16
only happen while holding the lock and
33:18
then so outside of this we still have
33:19
the lock here and we unlock after we're
33:21
done doing whatever we need to do here
33:24
at a high level this pattern looks like
33:26
this so we have one thread or some
33:28
number of threads doing something that
33:29
might affect the condition so they're
33:31
going to grab a lock do the thing call
33:33
broadcast then call unlock and on the
33:36
other side we have some thread that's
33:37
waiting for some condition to become
33:38
true the pattern there it looks like we
33:40
grab the lock then in a while loop while
33:42
the condition is false we wait and so
33:44
then we know that when we get past this
33:46
while loop now the condition is true and
33:48
we're holding the lock and we can do
33:50
whatever we need to do here and then
33:51
finally we call unlock so we can talk
33:54
about all the things that might go wrong
33:55
if you violate one of these rules like
33:57
after lecture if you're interested but
33:59
at a high level if you follow this
34:00
pattern then you won't need to deal with
34:02
those issues so any questions about that
34:08
yeah
34:14
so that's a great question
34:15
when do you use broadcast versus when do
34:17
use signals so converse have three
34:19
methods on them one is wait for the
34:21
waiting side and then on the other side
34:23
you can use signal or broadcast and the
34:25
semantics of those are signal wait wakes
34:27
up exactly one waiter like one thread
34:30
that may be waiting
34:30
whereas broadcast wakes up everybody
34:32
who's waiting and they'll all reach out
34:34
like they'll all try to grab the law can
34:35
recheck the condition and only one of
34:37
them will proceed because only one a
34:38
little he'll door lock until it gets
34:39
past this point I think for the purpose
34:42
of this class always use broadcast never
34:44
use signal if you follow this pattern
34:45
and just like don't use signal and
34:46
always use broadcast your code will work
34:48
I think you can stick think of signal as
34:51
something used for efficiency and we
34:54
don't really care about that level of
34:55
CPU efficiency in the labs for this
34:57
class
35:00
any more questions ok so the final topic
35:06
we're going to cover in terms of go
35:07
concurrency primitives is channels so
35:10
two high level channels are like a queue
35:12
like synchronization primitive but they
35:14
don't behave quite like cues in the
35:18
intuitive sense like I think some people
35:22
think of channels is like there's this
35:23
data structure we can sticks that stick
35:25
things in and eventually someone will
35:26
pull those things out but in fact
35:28
channels have no queuing capacity they
35:31
have no internal storage basically
35:34
channels are synchronous if you have to
35:36
go routines that are going to send and
35:37
receive on a channel if someone tries to
35:39
send on the channel while nobody's
35:41
receiving that thread will block until
35:43
somebody's ready to receive and at that
35:45
point synchronously it will exchange
35:47
that data over to the receiver and the
35:50
same is true the other direction if
35:51
someone tries to receive from a channel
35:53
while nobody's sending that receive will
35:54
block until there's another goroutine
35:56
that's about to send on the channel and
35:58
that send will happen synchronously so
36:00
here's a little demo program that
36:01
demonstrates this here I have a I
36:04
declare channel and then I spawn a go
36:06
routine that waits for a second and then
36:08
sent and then receives from a channel
36:11
and then in my main girl routine I keep
36:14
track of the time then I send on the
36:16
channel so I just put some dummy data
36:17
into the channel and then I'm going to
36:19
print out how long the send took
36:25
and if you think of channels as cues
36:29
with internal storage capacity you might
36:31
think of this thing as completing very
36:32
fast but that's not how channels work
36:35
this send is going to block until this
36:38
receive happens and this one happened
36:39
till this one second is the elapsed and
36:41
so from here to here
36:42
we're actually blocked in the main goo
36:44
routine for one whole second alright so
36:48
don't think of channels as queues think
36:50
of them as this synchronous like the
36:51
synchronous communication mechanism
36:55
another example that'll make this really
36:57
obvious is here we have a goroutine that
36:59
creates a channel then sends on the
37:01
channel and tries receiving from it
37:02
doesn't anybody know what'll happen when
37:04
I try running this
37:05
I think the file name might give it away
37:15
yeah exactly the send is going to block
37:18
till somebody's ready to receive but
37:19
there is no receiver and go actually
37:22
detects this condition if all your
37:24
threads are sleeping it to text this is
37:25
a deadlock condition and it'll actually
37:27
crash but you can have more subtle bugs
37:29
where if you have some other thread like
37:31
off doing something if I spawn this go
37:36
routine that you know for loop does
37:38
nothing and I try running this program
37:41
again now it goes deadlock detector
37:44
won't notice that all threads are not
37:45
doing any use will work like there's one
37:46
thread running it's just this is never
37:48
receiving and we can tell by looking at
37:50
this program that it'll never terminate
37:51
but here it just looks like it hangs so
37:54
if you're not careful with channels you
37:56
can get these subtle bugs where you have
37:58
double X as a result yeah yeah exactly
38:05
there's no data nobody's sending on this
38:07
channel so this is gonna block here it's
38:08
never gonna get to this line
38:19
yeah so channels as you pointed out
38:22
can't really be used just within a
38:23
single goroutine it doesn't really make
38:25
sense because in order to send or in
38:27
order to receive there has to be another
38:28
go routine doing the opposite action at
38:30
the same time so if there isn't you're
38:32
just gonna block forever and then that
38:34
chant but thread will no longer do any
38:35
useful work yeah sans wait for receives
38:44
receives wait for signs and it happens
38:45
synchronously once there's both the
38:47
sender and receiver present what I
38:53
talked about so far is unbuffered
38:54
channels I was going to avoid talking
38:56
about buffered channels because there
38:57
are very few problems that they're
38:58
actually useful for solving so buffered
39:00
channels can take in a capacity and then
39:04
you can think of it as it's just switch
39:07
this to so here's a buffered channel
39:09
with a capacity of one this program does
39:11
terminate because buffered channels are
39:14
like they have some internal storage
39:16
space and until that space fills up
39:18
sends are non blocking because they can
39:20
just put that data in the internal
39:21
storage space but once the channel does
39:23
fill up then it does behave like a nun
39:26
buffer channel in the sense that further
39:28
sends will block until there's a receive
39:30
to make space in the channel but I think
39:34
at a high level we should avoid buffered
39:35
channels because they basically don't
39:37
solve any problems and another path and
39:39
other things should be thinking about is
39:40
whenever you to make up arbitrary
39:41
numbers like this one here to make your
39:43
code work you're probably doing
39:44
something wrong yeah
40:00
so I think this is a question about
40:02
terminology like what exactly does
40:04
deadlock mean into this count as a
40:05
deadlock like yes this counts as a
40:06
deadlock like no useful progress will be
40:08
made here like this these threads are
40:10
just stuck forever
40:12
any other questions so what our channel
40:16
is useful for I think channels are
40:18
useful for a small set of things like
40:20
for example I think for producer
40:23
consumer queues sort of situations like
40:25
here I have a program that makes a
40:26
channel and this spawns a bunch of
40:28
goroutines that are going to be doing
40:29
some work like say they're competing
40:31
some result in producing some data and I
40:33
have a bunch of these go routines
40:34
running in parallel and I want to
40:36
collect all that data as it comes in and
40:37
do something with it
40:38
so this do work thing just like waits
40:40
for a bit and produces a random number
40:42
and in the main goroutine I'm going to
40:43
continuously receive on this channel and
40:45
print it out like this is a great use of
40:47
channels another good use of channels is
40:50
to achieve something similar to what
40:52
wait groups do so rather than use a wait
40:56
group suppose I want to spawn a bunch of
40:57
threads and wait till they're all done
40:58
doing something one way to do that is to
41:01
create a channel and then I spawn a
41:03
bunch of threads and know how many
41:04
threads I've spawned so five goroutines
41:06
created here they're going to do
41:07
something and then send on this channel
41:09
when they're done and then in the main
41:11
go routine I can just receive from that
41:13
channel the same number of times and
41:15
this has the same effect as a wait group
41:22
so question so what exactly is the
41:31
question
41:33
[Music]
41:37
so the question is here could you use a
41:39
buffered channel with a capacity of five
41:41
because you're waiting for five receives
41:43
I think in this particular case yes that
41:45
would have the equivalent effect but I
41:47
think there's not really a reason to do
41:49
that
41:50
and I think at a high level in your code
41:52
you should avoid buffer channels and
41:53
also maybe even channels unless you
41:55
think very hard about what you're doing
41:57
yeah so what is a weight group I think
42:08
we covered this in a previous lecture
42:09
and I talked about it very briefly today
42:11
but I do have an example of weight
42:14
groups so a weight group is a yet
42:18
another synchronization primitive
42:20
provided by go in the sync package and
42:21
it kind of does what his name advertises
42:24
like it lets you wait for a certain
42:25
number of threads to be done the way it
42:27
works is you call weight group dot add
42:29
and that basically increments some
42:31
internal counter and then when you call
42:33
weight group dot weight it waits till
42:35
done has been called as many times as ad
42:38
was called so this code is basically the
42:42
same as the code I just showed you that
42:43
was using a channel except this is using
42:45
weight group they have the exact same
42:47
effect you can use either one yeah
43:01
so the question here is about race
43:04
conditions I think like what happens if
43:06
this ad doesn't happen fast enough
43:09
before this weight happens or something
43:11
like that well so here notice that the
43:13
pattern here is we call weight group
43:15
data outside of this go routine and it's
43:19
called before spawning this go routine
43:21
so this happens first this happens next
43:24
and so we'll never have the situation
43:26
we're done happens after this ad happens
43:29
for this particular routine how's this
43:51
implemented by the compiler and I will
43:54
not talk about that now but talk to me
43:55
after class or in office hours but I
43:57
think for the purposes class like you
43:59
need to know the API for these things
44:00
not the implementation all right and so
44:04
I think that's basically all I have on
44:07
go concurrency primitives so one final
44:12
thought is on channels like channels are
44:13
good for a specific set of things like I
44:15
just showed you the producer consumer
44:16
queue or like implementing something
44:17
like weight groups but I think when you
44:19
try to do fancier things with them like
44:21
if you want to say like kick another go
44:24
routine that may or may not be waiting
44:25
for you to be like woken up that's a
44:27
kind of tricky thing to do with channels
44:28
there's also a bunch of other ways to
44:30
shoot yourself in the foot with them I'm
44:31
going to avoid showing you examples of
44:33
bad code with channels just because it's
44:35
not useful to see but I personally avoid
44:37
using channels for the most part and
44:39
just use shared memory and mutexes and
44:42
condition variables and set and I
44:43
personally find those much easier to
44:44
reason about so feel free to use
44:48
channels for when they make sense but if
44:49
anything looks especially awkward to do
44:51
with channels like just use mutexes and
44:52
condition variables and they're probably
44:53
a better tool yeah
45:02
so the question is with the difference
45:04
between this producer-consumer pattern
45:06
here in a thread-safe FIFO I think
45:07
they're kind of equivalent like you
45:09
could do this with the thread-safe FIFO
45:11
and it like that is basically what a
45:14
like buffered channel is roughly if
45:35
you're in queueing things in Indy
45:37
queueing things like if you want this
45:38
line to finish and have this thread go
45:40
do something else while that data sits
45:41
there in a queue rather than this girl
45:43
routine waiting to send it then a
45:45
buffered channel might make sense but I
45:48
think at least in the lives you will not
45:50
have a pattern like that all right so
45:53
next Fabian's going to talk about more
45:56
rapidly related stuff do you need this
46:13
all right can you all hear me is this
46:16
working yeah all right so yeah basically
46:24
I'm going to show you two bugs that we
46:27
commonly see in people's raft
46:29
implementations there's a lot of bugs
46:30
that are pretty common but I'm just
46:32
going to focus on two of them so in this
46:35
first example we sort of have a start of
46:38
a raft implementation for that's sort of
46:41
like what you might see for to a just
46:43
the beginnings of one
46:43
so in our raft state we have primarily
46:48
the current status of the raft pier
46:50
either follower candidate or leader and
46:52
we have these two state variables that
46:54
were keeping track of the current term
46:55
and who we voted for in the current term
46:58
so I'm I want us to focus though on
47:01
these two functions attempt election and
47:04
call request vote so in a temptin we're
47:07
just going to set our state to candidate
47:10
increment our current term vote for
47:13
ourselves and then start sending out
47:15
request votes to all of our raft peers
47:17
and so this is similar to some of the
47:19
patterns that Anish showed where we're
47:23
going to loop through our peers and then
47:25
for each one in a go routines separately
47:28
call this call request vote function in
47:30
order to actually send an RPC to that
47:32
peer
47:33
alright so in call request vote we're
47:36
going to acquire the lock prepare
47:40
arguments for our request Ville RPC call
47:42
based on by setting it to the current
47:44
term and then actually perform the RPC
47:47
call over here and finally based on the
47:49
response we will reply back to this this
47:54
attempt election function and the
47:55
attempt election function eventually
47:56
should tally up the votes to see if it
47:58
got a majority of the votes and can
48:00
become leader so what happens when we
48:04
run this code so in theory what we might
48:06
expect to happen is four so there's
48:08
going to be some code that's going to
48:10
spawn a few graph spears and actually
48:11
try to attempt elections on them and
48:13
what should happen are we just start
48:18
collecting votes from other peers and
48:19
then we're not actually going to tally
48:21
them up
48:21
but hopefully nothing weird goes wrong
48:24
but actually something is going to go
48:26
wrong here and we actually activated
48:29
goes deadlock detector and somehow we
48:31
ran into a deadlock so let's see what
48:33
happened for now let's focus on what's
48:37
going on with the server zero so server
48:40
zero it says it starts attempting an
48:42
election at term one that's just
48:45
starting the attempt election function
48:47
it will acquire the lock set some of the
48:49
set some stuff up for performing the
48:50
election and then unlock then it's going
48:57
to send out a request vote RPC - server
48:59
- it finishes processing that request
49:03
vote RPC over here so we're just
49:05
printing right before and after we
49:07
actually send out the RPC and then it
49:09
sends out a request vote RPC - server
49:11
one but after that it never we never
49:14
actually see it finish sending the
49:15
request vote RPC so it's actually stuck
49:18
in this function call waiting for the
49:21
RPC response from server 1 all right now
49:24
let's look at what's everyone's doing so
49:25
it's it's pretty much the same thing it
49:27
sends a request vote I received a server
49:29
to that that succeeds it finishes
49:32
processing that request vote the
49:34
response from server 2 then it sends
49:36
this RPC to zero and now what's actually
49:39
happening is 0 & 1 are sort of waiting
49:41
for the RPC responses from each other
49:43
they both sent out an RPC call but not
49:45
yet got the response yet and that's
49:48
actually sort of the cause of our
49:50
deadlock so really what's the reason
49:53
that we're dead locking is because we're
49:54
holding this lock through our RPC calls
49:57
over here in the core requests vote
49:59
function we acquire our mutex associated
50:02
with our raft peer and we only unlock at
50:04
the end of this function so throughout
50:06
this entire function we're holding the
50:07
lock including when we try to contact
50:10
our peer to get the vote and later when
50:17
we handle this request vote RPC we
50:22
actually only see it at the beginning of
50:24
this function in the handler we're also
50:26
trying to acquire the lock but we never
50:28
actually succeed in acquiring the lock
50:29
so just to make this a little bit more
50:31
clear the the sort of order of
50:34
operations
50:35
is happening is in call request vote
50:39
server zero is first going to acquire
50:42
the lock and send an RPC call to server
50:47
one and then simultaneously and
50:51
separately server one is going to do the
50:53
same thing it's going to enter its call
50:54
request vote function acquire the lock
50:56
and send this RPC call to server zero
51:01
now in server zeros handler and server
51:05
ones handler they're trying to acquire
51:07
the lock but they can't because they
51:10
already are acquiring the lock and
51:11
trying to send the RPC call to each
51:13
other and that that's actually what's
51:15
leading to the deadlock situation so to
51:18
solve this basically we want you to not
51:21
hold locks through RPC calls and that's
51:23
the solution to this problem in fact we
51:27
don't need the lock here at all instead
51:29
of trying to read the current term when
51:32
we enter this call request vote function
51:34
we can pass this as an argument here
51:38
save the term when we had acquired the
51:42
lock earlier in this attempt election
51:44
and just passed this as a as a variable
51:47
to call request vote so that actually
51:48
removes the need to acquire the lock at
51:51
all in call request vote alternatively
51:55
we could lock while we're preparing the
51:58
arguments and then unlock before
51:59
actually performing the call and then if
52:01
we need to to process the reply we could
52:04
lock again afterwards so it's just make
52:05
sure to unlock before making it
52:08
obviously call and then if you need to
52:10
you can acquire the lock again so now if
52:14
I save this then so it's still
52:20
activating the deadlock detector but
52:21
that's actually just because we're not
52:23
doing anything at the end but now it's
52:25
actually working
52:26
we finished sending the request votes on
52:28
both sides and all the operations that
52:29
we wanted to complete are complete all
52:32
right any questions about this example
52:42
yeah so not it's sort of so you might
52:45
need to use locks when you are preparing
52:47
the arguments or processing the response
52:49
but yeah you shouldn't hold a lock
52:50
through the RPC call while you're
52:52
waiting for the other peer to respond
52:54
and there's actually another reason to
52:56
that in addition to deadlock the other
52:58
problem is that in some tests we're
53:00
going to sort of have this unreliable
53:03
network that could delay some of your
53:05
RPC messages potentially by like 50
53:08
milliseconds and in that case if you
53:11
hold the lock through an RPC call then
53:13
any other operation that you try to do
53:15
during that 50 milliseconds won't be
53:17
able to complete until that RPC response
53:19
is received so that that's another issue
53:22
that you might run into if you hold the
53:23
long so it's both to make things more
53:25
efficient and to avoid these potential
53:27
deadlock situations
53:37
all right so just one more example this
53:41
is again using a similar draft
53:45
implementation so again in our raft
53:47
state we're going to be keeping track of
53:48
whether a fuller candidate leader and
53:49
then also these two state variables in
53:52
this example I want you to focus on this
53:54
attempt election function so now we've
53:57
first implemented the change that I just
53:59
showed you to store the term here and
54:02
pass it as a variable to our function
54:04
that collects the request votes but
54:06
additionally we've implemented some
54:07
functionality to add up the votes so
54:10
what we'll do is we'll create a local
54:12
variable to count the votes and whenever
54:16
we get a vote if the vote was not
54:18
granted
54:19
we'll return immediately from this go
54:20
routine where we're processing the boat
54:22
otherwise we'll acquire the lock before
54:25
editing this shared local variable to
54:28
count up the votes and then if we did
54:31
not get a majority of the votes will
54:32
return immediately otherwise we'll make
54:34
ourselves the leader so as with the
54:38
other example I mean initially if you
54:42
look at this if I look at this like it
54:43
seems reasonable but let's see if
54:45
anything can go wrong all right so this
54:50
is the log output from one run and one
54:53
thing you might notice is that we've
54:54
actually elected two leaders on the same
54:57
term so server zero
54:59
it was elected made itself a leader on
55:03
term two and server one did as well it's
55:06
okay to have a leader elected on
55:08
different terms but here where we have
55:09
one on the same term that that should
55:11
never happen alright so how did this
55:13
actually come up so let's start from the
55:15
top so at the beginning server zero
55:18
actually attempted an election at term
55:20
one not turn two and it got its votes
55:23
from both of the other peers but for
55:27
whatever reason perhaps because those
55:28
reply messages from those peers were
55:30
delayed it didn't actually process its
55:34
process those votes until later and in
55:38
between receiving it like in between
55:40
attempting the election and finishing
55:42
the election server one also decided to
55:45
attempt an election perhaps because
55:47
because of server zero was delayed so
55:49
much server one might
55:50
actually ran into the election timeout
55:52
and then started its own election and it
55:54
started it on term 2 because it couldn't
55:57
have been termed 1 because it already
55:59
voted for server 0 on on term 1 over
56:01
here
56:03
okay so then server 1 sends out its own
56:08
request votes 2 servers 2 and 0 at term
56:11
2 and now we see that server two votes
56:14
for server 1 that's fine but server 0
56:16
also votes for server 1 this is actually
56:18
also fine because server one is asking
56:21
server 0 for a vote on a higher term and
56:25
so what server 0 should do is if you
56:28
remember from the spec it should set its
56:33
current term to that term in the request
56:35
for RPC message to term 2 and also
56:37
revert itself to a follower instead of a
56:39
candidate alright finally so the real
56:43
problem is that on this line where
56:44
server 0 although it really got enough
56:47
votes on term 1 it made itself a leader
56:49
on term - so the reason so one
56:53
explanation for why this is happening is
56:55
because in between where we set up the
56:57
election our attempt for the election
57:00
and where we actually process the votes
57:02
some other things are happening input in
57:05
this case we're actually voting for
57:07
someone else in between and so we're no
57:11
longer on term 1 where we thought we
57:12
started the election we're now on term 2
57:14
and so we just need a double check that
57:17
because we don't have the lock while
57:19
we're performing the RPC calls which is
57:21
important for its own reasons now some
57:23
things might have changed and we need to
57:24
double check that what we assume is true
57:26
when we're setting ourselves to the
57:28
leader is still true so one way to solve
57:32
this that there's a few different ways
57:34
like to solve this like you could
57:35
imagine not voting for others while
57:36
we're in the middle of attempting an
57:38
election but in this case the simplest
57:39
way to solve this at least in this
57:42
implementation is to just double check
57:45
that we're still on the same term and
57:46
we're still a candidate we haven't
57:48
reverted to a follower so actually one
57:50
thing I want to show you is if we do
57:52
print out our state over here then we do
57:57
see that server 0 became a follower but
58:00
it's still setting itself to a leader on
58:02
this line
58:04
so yeah we can just check for that if
58:07
we're not a candidate or the current
58:10
term doesn't match the term which we
58:12
started the election then let's just
58:14
quit and if we do that then
58:18
so everyone becomes a leader and we
58:20
never cease over zero become leader so
58:22
the problem solved any question yeah
58:28
yeah I think I think that would I
58:30
because we would not if the term is
58:35
higher now than actually no it would it
58:38
might not be sufficient because we might
58:40
have attempted another election it
58:42
depends on your implementation but it's
58:44
possible that you could have attempted
58:47
another election on a higher term
58:49
afterwards all we know that's the same
58:51
thing right yeah it would not be
58:52
sufficient to only check the state but I
58:54
think you're right if you only check the
58:56
term then it is sufficient all right any
59:01
other questions all right so yeah that's
59:09
it for this part she's going to show you
59:11
some more examples of actually debugging
59:14
some of these draft implementations
59:34
hi can you all hear me yeah
59:52
is it not
60:06
okay so in my section I'm gonna walk you
60:14
through how I would be but if you have
60:17
like a bug in your rough implementation
60:20
so I prepare a couple of baccara good
60:24
and I just try to walk you through it so
60:30
first I'm gonna go into my face
60:33
Bucky implementation and if I run the
60:41
test here so for this one it doesn't
60:52
print anything it just gets started and
60:55
it's gonna be here forever and let's
61:00
assume that I have no idea why there's
61:02
happening
61:03
the first thing that I want to find out
61:07
is where it gets started and we we do
61:13
have a good tool for that which printf
61:16
but in the stop code if you go to
61:21
youtube go we have a function called the
61:25
printf this is just a nice wrapper
61:28
around the block cleaners with the
61:32
debugger able to enable or disable the
61:36
locking messages so I'm gonna enable
61:40
that and go back to my graph okay so
61:46
first of all when i when when there
61:50
there's something that's but happening I
61:54
always go check if the code actually
62:02
actually initialize graph server so here
62:08
I'll just clean
62:20
okay so here if I run the test again
62:27
then now I know that there are three
62:31
servers that God he initialized so this
62:36
files is okay but like there's nowhere
62:43
where the bow is happening so I'll just
62:45
go deeper into the hood just to find
62:48
where it gets stuck so now if you see
62:50
the code we are calling the leader a
62:55
tech election so I'm gonna go to that
62:58
function and just to make faster I'll
63:06
try to check if it kicks off some
63:08
iteration
63:21
that part still fine so we we try to go
63:25
for now here we are in the election I'll
63:31
see if there's so we actually send the
63:37
request voice to some other servers
64:00
now we kind of have like more idea of
64:02
where guests are because it's not
64:04
printing that some sorry that kicks off
64:08
the election are not sending the request
64:11
words so I would go back for her just to
64:17
see where customers like I always tried
64:21
here prin if if we call some function
64:27
aye-aye
64:29
I was always double shake if it actually
64:31
go into the function so now I'm going to
64:37
say that this service is at the start of
64:42
the election
64:50
and that works so now we have an idea of
64:56
like the box should be between here and
65:02
here so we are trying to minimize the
65:06
scope of the code that's causing the bug
65:14
let's say if I pin something here
65:28
and it does it doesn't get there so I
65:33
move it up let's say here still not
65:41
there
65:48
now it's there so the bug is probably in
65:55
this function and I just go check so
66:02
here the problem is that I'm trying to
66:05
acquire a lock where I actually do have
66:08
the lock so it's gonna be a day long so
66:13
that's how I will find their first bug
66:16
using the D printers and it's it's nice
66:22
to use the printf because you can like
66:29
just turn off the debugging print and
66:33
have a nice test output with our audit
66:38
debugging if you want it so that's how I
66:43
would use it deep enough to try to like
66:47
handle a bug in your code and for this
66:51
example there's actually another trick
66:54
to help you find this kind of deadlock
66:59
so if you press ctrl + backslash you can
67:04
see in the bottle but bottom left that I
67:09
press like control and backslash this
67:13
this command will send a signal quit
67:16
today
67:17
go program and by default it will
67:21
handles the the quiz signal and quit all
67:26
the go routines and print audio strike
67:29
the stack rates so now this like Chico
67:41
up here like this way it gets touched
67:43
and then there are gonna be a couple
67:47
functions printing here
67:55
just trying to go through all the traces
68:07
yes so it's actually showing that the
68:11
function that's causing the problem is
68:14
the cover to candidate so that's another
68:17
weight you've to find out where the day
68:20
locks are I can remove all this
68:43
and now it works so that's the first
68:47
example that I want to go through second
68:51
thing that you want it you want to do
68:54
before you submit your labs is to turn
68:57
the race
68:58
flag on when you do the test the way to
69:03
do that is just to add - race before
69:07
- groin and here because my implement
69:18
implementation doesn't have any research
69:20
so it's not going to tell you anything
69:22
but this just be careful about this
69:25
because it's not a proof that you don't
69:29
have any really it's just that it cannot
69:33
detect races for you I'm going to run
69:42
the same command again with the red flag
69:45
but now this time that's actually risk
69:48
going on in my implementation so it's
69:56
gonna yell at you that there's some
70:00
deliveries going on in your code
70:08
I'm quitting that and let's see like how
70:13
useful is the warning are so I'm gonna
70:20
go to my second implementation with
70:27
Rothko and here
70:37
let's look at this race so it's telling
70:45
us that there's a wheat going on at the
70:48
line
70:49
103 I'm going to that line so this the
70:54
wheat on probably Thursday here and
71:08
there's also a right line for 12 which
71:20
is Thursday so
71:38
I'm going to this line again
71:45
and now we kind of know that this this
71:48
radiation is protected by a lock so the
71:53
risk flies actually wanting us and
71:56
helping us to find out but on on this
72:00
database that we have so the fake it's
72:05
gonna be just you lock this and unlock
72:15
it and that should solve the problem
72:28
so at this place we kind of know how to
72:31
basic like do some basic debugging does
72:35
anyone have any question no okay yeah so
72:42
I'm going to go to the third one which
72:46
is going to be more difficult to find
72:50
about I'm going to test the run the
73:01
centers and now I am I actually have
73:04
some debugging messages in there already
73:10
and just see that I also have a
73:17
debugging message with the test action
73:20
there's something you might want to
73:22
consider doing if you go into the test
73:34
clip here
73:42
you can just see how the test would run
73:46
and then there are some actions that the
73:49
test clip is gonna do to make your code
73:52
fail and it's usually a good idea to
73:58
print out where that action is happening
74:02
in your actual debugging message so you
74:07
can guess what is happening like where
74:13
the bug is happening in which phase of
74:18
the test if that make sense so now it's
74:22
like I was doing fine in the first case
74:27
I passed I passed the fail but I'm
74:30
failing their second tiers and here the
74:37
Test section is to found one as a little
74:40
one so I'm passing this the test until
74:46
this and if you go to I'm actually
74:51
passing until the leader two rejoins so
74:57
this can give you a nice idea of how the
74:59
test is working and just to help you
75:09
have a better case as where the bondage
75:13
is in your code so now let's look at the
75:21
debugging messages
75:32
so it's least it seems like when liturgy
75:35
we joined it becomes a follower and we
75:40
have a new leader
75:41
so that looks fine to me and we probably
75:46
need more debugging messages instead of
75:50
just their state changes so I am going
76:00
to add some more my first case that when
76:05
one becomes a leader it might not be
76:08
doing what a leader should you correctly
76:13
so we got stuck
76:23
so you might could after we cover it as
76:26
eventually there I have a go routine
76:30
call operate leader
76:32
there's just standing habit CD all set
76:34
to the audio servers so I'm gonna print
76:41
some stuff here saying happy - cheers
76:54
away
77:20
so to become solidary it sends the the
77:25
first happy to each server and one still
77:33
tries to send happy to the new leader
77:41
and then one becomes a follower so this
77:46
doesn't look like to be a problem now
77:54
I'm gonna check if the other service
77:56
receive habit correctly
79:25
it's taking away with I'm trying to
79:29
finish this yeah so to becomes a leader
79:37
to sends high bid but no one receive a
79:43
habit form - so if I go to the same
79:54
opinion tree I actually hold the law to
79:59
the RPC Hall which is the problem that
80:03
Fabian went to in the last section so
80:07
that's that's the problem that I need to
80:10
fix so what I should do is to a lot here
80:23
and then
80:33
lock again here and that should work
80:47
we pass and then there are couple things
80:53
that you might want to do when you test
80:58
your rough implementation so that's
81:03
actually script to run the test in
81:09
imperial and I can show you how I how we
81:14
can use how we can use it this creep is
81:18
in the inner peer support some someone
81:21
make a point about it and here's how we
81:27
can use the script so you run the script
81:33
specify the number of the test
81:36
personally I do like a 1000 but that
81:40
depends on your preference this is the
81:44
number of course that you wanna run the
81:47
test at the same time and then here's
81:49
the test and if you run the script then
81:59
if you show you that's like we have run
82:04
four tests so far all are working fine
82:09
and it's gonna keep going like that so
82:13
that's how I would go about debugging
82:17
rough implementation and you are all
82:19
welcome to come to office hours when you
82:22
need help