字幕記錄 00:00 hey the TAS are going to be giving a 00:02 lecture on concurrency and go basically 00:06 this lecture is going to be full of 00:07 design patterns and practical tips to 00:10 help you with the labs we're going to be 00:13 covering briefly the code memory model 00:15 the reading which we went over and then 00:17 spend most of the lecture talking about 00:18 concurrency primitives and go 00:20 concurrency patterns and go how you do 00:22 things that you will need to do in the 00:24 labs and then finally we'll talk through 00:26 some debugging tips and techniques and 00:27 show you some interesting tools that you 00:29 might want to use when debugging the 00:31 labs so very briefly on the go memory 00:34 model on the reading so why did we 00:36 assign this reading well the goal was to 00:38 give you some concrete examples of 00:40 correct ways to write threaded code and 00:43 go so the document like in the second 00:45 half of the document has some examples 00:46 of crack code and an incorrect code and 00:48 how it can go wrong so one thing you 00:51 might have noticed in the document is 00:52 early on it says if you need to read and 00:54 understand this you're being too clever 00:56 and we think that that's good advice so 00:58 focus on how to write correct code don't 01:01 focus way too much on the happens before 01:03 relation and being able to reason about 01:04 exactly why incorrect code is incorrect 01:06 like we don't really care we just want 01:08 to be able to write correct code and 01:09 call it a day 01:10 one question that came up in the lecture 01:13 questions was like talking about 01:16 goroutines in relation to performance 01:18 and so we just wanted to say that 01:20 goroutines and like in general 01:22 concurrency can be used for a couple 01:24 different reasons and the reason we use 01:26 concurrency in the labs is not 01:28 necessarily for performance like we're 01:30 not going for parallelism using multiple 01:33 cores on a single machine in order to be 01:34 able to do more work on the CPU 01:36 concurrency gets us something else 01:38 besides performance through parallelism 01:40 it can get us better expert Civet ii 01:42 like we want to write down some ideas 01:43 and it happens to be that writing down 01:46 code that uses threads is a clean way of 01:48 expressing those ideas and so the 01:51 takeaway from that is when you use 01:52 threads in lab 2 and Beyond don't try to 01:55 do fancy things you might do if you're 01:57 going for performance especially CPU 01:59 performance like we don't care to do 02:00 things like using fine-grained locking 02:02 or other techniques use basically write 02:06 code that's easy to reason about use big 02:08 locks to protect large critical sections 02:10 and just like don't worry about 02:12 performance in the sense of CPU 02:13 performance 02:16 so with that that's all we're going to 02:18 say about the memory model and spend 02:20 most of this lecture just talking about 02:22 go code and go concurrency patterns and 02:25 as we go through these examples feel 02:27 free to ask any questions about what's 02:28 on the screen or anything else you might 02:29 think about so I'm going to start off 02:32 talking about concurrency primitives and 02:34 go so the first thing is closures this 02:37 is something that will almost certainly 02:38 be helpful in the labs and this is 02:41 related to go routines so here's this 02:44 example program on the screen and what 02:46 it does is the main function declares a 02:48 bunch of variables and then spawns this 02:50 go routine in here with this go 02:51 statement and we noticed that the score 02:53 routine is not taking it as an argument 02:55 a function call to some function defined 02:57 elsewhere but this anonymous function 02:59 just defined in line here so this is a 03:01 handy pattern this is something called a 03:02 closure and one neat thing about this is 03:04 that this function that's defined here 03:06 can refer to variables from the 03:08 enclosing scope so for example this 03:10 function can mutate this variable a 03:12 that's defined up here or refer to this 03:15 weight group that's defined up here so 03:19 if we go run this example it does what 03:23 you think it does 03:24 the weight group dot done here let's the 03:27 main thread continue past this point it 03:29 prints out this variable which has been 03:30 mutated by this concurrently running 03:32 thread that finished before this weight 03:34 happened so this is a useful pattern to 03:38 be able to use one like the reason we're 03:41 pointing this out is because you might 03:46 have code that looks like this in your 03:47 labs very similar to the previous 03:50 example except this is code that is 03:52 spawning a bunch of threads in a loop 03:53 this is useful for example when you want 03:55 to send our pcs in parallel right so 03:58 like in lab two if you have a candidate 04:00 asking for votes you want to ask for 04:02 votes from all the followers in parallel 04:04 not one after the other because the RPC 04:06 is a blocking operation that might take 04:07 some time or similarly the leader might 04:09 want to send append entries to all the 04:10 followers you want to do it in parallel 04:12 not in series and so threads are a clean 04:14 way to express this idea and so you 04:17 might have code that looks kind of like 04:18 this at a high level in a for loop you 04:20 spawn a bunch of go routines one thing 04:23 to be careful about here this is 04:24 something that was talked about in a 04:26 previous lecture 04:27 is identifier capture and goroutines and 04:31 mutation of that identifier in the outer 04:33 scope so we see here that we have this 04:35 eye that's being mutated by this for 04:37 loop and then we want to use that value 04:39 inside the square root een and the way 04:41 we do that like the correct way of 04:42 writing this code is to pass this value 04:44 I as an argument to this function and 04:46 this function or you can rename it to X 04:49 inside here and then use the value 04:50 inside and so if we run this program 04:53 so here I've kind of stubbed out to send 04:55 our PC thing was actually just prints 04:57 out the index this I might be like the 04:59 index of the follower trying to send an 05:01 RPC to here prints out the numbers 0 05:04 through 4 in some order so this is what 05:06 we want like send our PCs to all the 05:07 followers the reason we're showing you 05:09 this code is because there's a variation 05:11 of this code which looks really similar 05:13 and maybe intuitively you might think it 05:15 does the right thing but in fact it 05:16 doesn't so in this code the only thing 05:18 that's changed is we've gotten rid of 05:21 this argument here that we're explicitly 05:23 passing and instead we're letting this I 05:26 refer to the eye from the outer scope so 05:29 you might think that when you run this 05:30 it does the same thing but in fact in 05:33 this particular run it printed 4 5 5 5 5 05:35 so this would do the wrong thing and the 05:39 reason for this is that this I is being 05:41 mutated by this outer scope and by the 05:43 time this go routine ends up actually 05:45 executing this line well the for loop 05:47 has already changed the value of I so 05:49 this doesn't do the right thing so at a 05:52 high level if you're spawning goroutines 05:54 in a loop just make sure that you use 05:57 this pattern here and everything will 06:00 work right any questions about that 06:03 so it's just like a small gotcha but 06:05 we've seen this a whole bunch of times 06:06 in office hours so I just wanted to 06:07 point this out all right so moving on to 06:10 other patterns that you might want to 06:13 use in your code oftentimes you want 06:17 code that periodically does something a 06:19 very simple way to do that is to have a 06:22 separate function that in an infinite 06:24 loop does something in this case we're 06:26 just printing out tick and then use this 06:29 time dot sleep to wait for a certain 06:31 amount of time so very simple pattern 06:34 here 06:34 you don't need anything fancier than 06:36 this to do something periodically 06:41 one modification of this that you might 06:43 want is you want to do something 06:44 periodically until something happens for 06:47 example you might want to start up a 06:49 raft here and then periodically send 06:51 heartbeats but when we call dot kill on 06:54 the raft instance you want to actually 06:56 shut down all these goroutines so you 06:58 don't have all these random goroutines 06:59 still running in the background and so 07:01 the pattern for that looks something 07:03 like this you have a goroutine that will 07:08 run in an infinite loop and do something 07:10 and then wait for a little bit and then 07:13 you can just have a shared variable 07:14 between whatever control thread is going 07:16 to decide whether this goroutine should 07:18 die or not so in this example we have 07:20 this variable done that's a global 07:21 variable and what main does is it waits 07:23 for awhile and sets done to true and in 07:26 this go routine that's ticking and doing 07:28 work periodically we're just checking 07:31 the value of done and if done is set 07:32 then we terminate the square-root eeen 07:34 and here since done is a shared variable 07:38 being mutated and read by multiple 07:40 threads we need to make sure that we 07:41 guard the use of this with a lock so 07:44 that's where this mute outlaw can mute 07:45 it unlock comes in for the purpose of 07:49 the labs you can actually write 07:50 something a little bit simpler than this 07:51 so we have this method RF killed on your 07:55 raft instance so you might have code 07:57 that looks a little bit more like this 07:58 so while you're wrapped instance is not 08:01 dead you want to periodically do some 08:02 work any questions about that so far 08:10 yeah question 08:21 does using the locking mechanisms for 08:25 channels make it so that any right 08:30 stunts any variables and those functions 08:32 are to be observed by the fencer would 08:36 you need to send done across the channel 08:41 okay so let me try to simplify the 08:44 question a bit I think the question is 08:45 do you need to use locks here can you 08:47 use channels instead and R and can you 08:51 get away with not using locks and like 08:52 what's the difference between nothing 08:53 versus channels vs locks is that 08:55 basically what you're asking I think the 09:08 question is this done does it not need 09:10 to be sent across a channel does just 09:12 using these locks ensure that this read 09:15 here observes the write done by a thread 09:18 okay so the answer is yes basically at a 09:21 high level if you want to ensure cross 09:23 thread communication make sure you use 09:25 go synchronization primitives whether 09:27 it's channels or locks and condition 09:30 variables and so here because of the use 09:33 of locks after this thread writes done 09:35 and does unlock the next lock that 09:38 happens is guaranteed to observe the 09:40 writes done before that before this 09:42 unlock happened so you have this write 09:44 happened and this unlock happened then 09:46 one of these locks happens and then the 09:48 next done will be guaranteed to observe 09:49 that write of true question 10:02 that's a good question in this 10:05 particular code it doesn't matter but it 10:07 would be cleaner to do it so the 10:08 question is why don't we do mu dot 10:10 unlock here before returning and the 10:14 answer is in here there's no more like 10:16 the program's done so it doesn't 10:17 actually end up mattering but you're 10:18 right that like in general we would want 10:21 to ensure that we unlock before we 10:23 return yeah thanks for pointing that out 10:41 so I'm not sure entirely what the 10:43 question is but maybe something like can 10:44 both of these acquire the lock at the 10:46 same time is that the question and we'll 11:00 talk a little bit more about locks in 11:03 just a moment but at a high level the 11:04 semantics of a lock are the lock is 11:07 either held by somebody or not held by 11:09 somebody and if it's not held by 11:11 somebody then if someone calls lock they 11:12 have the chance to acquire the lock and 11:14 if before they call unlock somebody else 11:17 calls lock that other thread is going to 11:18 be blocked until the unlock happens then 11:20 the lock is free again so at a high 11:22 level between the lock and the unlock 11:25 for any particular lock like any only a 11:27 single thread can be executing what's 11:29 called a critical section between the 11:30 lock and unlock regions any other 11:35 questions 12:02 so the question is related to timing 12:05 like when you set done equals true and 12:08 then you unlock you have no guarantee in 12:09 terms of real time like when periodic 12:11 will end up being scheduled and observe 12:13 this right and actually end up 12:14 terminating and so yes if you want to 12:17 mean to actually ensure that periodic 12:19 has exited for some particular reason 12:20 then you could write some code that 12:22 communicates back from periodic 12:24 acknowledging this but in this 12:25 particular case like the only reason we 12:27 have the sleep here is just to 12:28 demonstrate that the sleep here is just 12:35 to demonstrate that tick prints for a 12:37 while and then periodic as indeed cancel 12:39 it because it stops being printed before 12:41 I get my shell prompt back and in 12:44 general for a lot of these background 12:45 threads like you can just say that you 12:47 want to kill them and it doesn't matter 12:48 if they're killed within 1 second or 12:49 within 2 seconds or one exactly go 12:51 schedules it because this thread is 12:52 going to just observe this right to done 12:54 and then exit do no more works it 12:56 doesn't really matter and also another 12:58 thing in go is that if you spawn a bunch 13:00 of goo routines one of them is the main 13:02 go routine this one here and the way go 13:04 works is that if the main goroutine 13:06 exits the whole program terminates and 13:08 all go routines are terminated 13:27 that's a great question okay so I think 13:31 the question is something like why do 13:32 you need locks at all like can you just 13:33 delete all the locks and then like 13:36 looking at this code it looks like okay 13:37 main does a right to true at some point 13:39 and periodic is repeatedly reading it so 13:41 at some point it should observe this 13:43 read right well it turns out that like 13:45 this is why go has this fancy memory 13:48 model and you have this whole thing on 13:49 that happens before relation the 13:51 compiler is allowed to take this code 13:53 and emit a kind of low-level machine 13:55 code that does something a little bit 13:57 different than what you intuitively 13:58 thought would happen here and we can 14:00 talk about that in detail offline after 14:02 the lecture and office hours but at a 14:05 high level I think one rule you can 14:07 follow is if you have accesses to shared 14:09 variables and you want to be able to 14:10 observe them across different threads 14:12 you need to be holding a lock before you 14:14 read or write those shared variables in 14:15 this particular case I think the go 14:18 compiler would be allowed to optimize 14:19 this to like lift the read of done 14:21 outside the four so read this shared 14:24 variable once and then if done is false 14:27 then set like make the inside be an 14:29 infinite loop because like now the way 14:32 this thread is written it had uses no 14:34 synchronization primitives there's no 14:35 mutex lock or unlock no channel sends or 14:37 receives and so it's actually not 14:39 guaranteed to observe any mutations done 14:41 by other concurrently running threads 14:43 and if you look on Piazza I've actually 14:45 like written a particular go program 14:47 that is optimized in the unintuitive way 14:49 like it'll produce code that does an 14:50 infinite loop even though looking at it 14:52 like you might think that oh the obvious 14:54 way to compile this code will produce 14:56 something that terminates yeah so the 15:00 memory model is pretty fancy and it's 15:02 really hard to think about why exactly 15:04 incorrect programs are incorrect but if 15:06 you follow some general rules like whole 15:08 blocks before you mutate shared 15:10 variables then you can avoid thinking 15:11 about some of these nasty issues 15:15 any other questions all right so let's 15:19 talk a little bit more about mutexes now 15:21 so why do you need mutex is at a high 15:25 level whenever you have concurrent 15:28 access but by different threads to some 15:30 shared data you want to ensure that 15:32 reads and writes of that data are atomic 15:35 so here's one example of program that 15:38 declares a counter and then spawns a 15:40 goroutine 15:40 actually spawns a thousand go routines 15:42 that each update the counter value and 15:44 increment it by one and you might think 15:46 that looking at this intuitively when I 15:48 print out the value of the counter at 15:49 the end it should print a thousand but 15:51 it turns out that we missed some of the 15:53 updates here and in this particular case 15:55 it only printed 947 so what's going on 15:59 here is that this update here is not 16:03 really protected in any way and so these 16:05 threads running concurrently can read 16:06 the value of counter and update it and 16:08 clobber other threads updates of this 16:11 value like basically we want to ensure 16:13 that this entire section here happens 16:16 atomically and so the way you make 16:18 blocks of code run atomically are by 16:22 using locks and so in this code example 16:25 we've fixed this bug we create a lock 16:29 and then all these go routines that 16:30 modify this counter value first grab the 16:33 lock then update the counter value and 16:36 then unlock and we see that we're using 16:37 this defer keyword here what this does 16:39 is basically the same as putting this 16:42 code down here so we grab a lock do some 16:45 update then unlock defer is just a nice 16:47 way of remembering to do this you might 16:51 forget to write the unlock later and so 16:53 what defer does is it you can think of 16:54 it as like scheduling this to run at the 16:56 end of the current function body and so 16:59 this is a really common pattern you'll 17:00 see for example in your RPC handlers for 17:02 the lab so oftentimes RPC handlers will 17:05 manipulate either read or write data on 17:08 the RAF structure right and those 17:10 updates should be synchronized with 17:12 other concurrently happening updates and 17:13 so oftentimes the pattern for RPC 17:15 handles would be like grab the lock 17:17 differ unlock and then go do some work 17:19 inside so we can see if we run this code 17:26 it produces the expected results so it 17:29 prints out a thousand and we haven't 17:30 lost any of these updates and so what at 17:34 a high level what a lock or a mutex can 17:36 do is guarantee mutual exclusion for a 17:38 region of code which we call a critical 17:40 section so in here this is the critical 17:41 section and it ensures that none of 17:43 these critical sections execute 17:45 concurrently with 17:46 ones they're all serialized happened one 17:47 after another question 18:00 yes so this is a good observation this 18:03 particular could is actually not 18:04 guaranteed to produce a thousand 18:05 depending on how thread scheduling end 18:07 up ends up happening because all the 18:08 mean guru teen does is it waits for one 18:10 second which is some arbitrary unit of 18:12 time and then it prints out the value of 18:14 the counter I just want to keep this 18:15 example as simple as possible a 18:17 different way to write this code that 18:18 would be guaranteed to print a thousand 18:20 would be to have the main goroutine wait 18:22 for all these thousand threads to finish 18:24 so you could do this using a weight 18:25 group for example but we didn't want to 18:26 put two synchronization primitives like 18:28 weight groups and mutex is in the same 18:29 example so that's why we're at this code 18:31 that is like technically incorrect but I 18:33 think it still demonstrates the point of 18:34 locks any other questions great so at a 18:43 very high level you can think of locks 18:44 is like you grab the lock you mutate the 18:46 shared data and then you unlock so does 18:49 this pattern always work well turns out 18:52 that that's like a useful starting point 18:57 for how to think about locks but it's 18:58 not really the complete story so here's 19:01 some code this doesn't fit on the screen 19:03 but I'll explain it to you we can scroll 19:04 through it it basically implements a 19:06 bank at a high level so I have Alice and 19:08 Bob who both start out with some 19:10 balances and then I keep track of what 19:12 the total balances like the total amount 19:14 of money I store in my bank and then I'm 19:16 going to spawn to go routines that will 19:18 transfer money back and forth between 19:19 our Alice and Bob so this one girl 19:21 routine that a thousand times will 19:23 reduce one from Alice and send it to Bob 19:26 and concurrently running I have this 19:28 other go routine that in a loop will 19:30 reduce one from Bob and send it to Alice 19:31 and notice that I have this mutex here 19:35 and whenever I manipulate these shared 19:38 variables between these two different 19:39 threads 19:39 I'm always locking the mutex and this 19:42 update only happens while this lock is 19:44 held right and so is this code correct 19:49 or incorrect there actually isn't really 19:53 a straightforward answer to that 19:55 question it depends on like what are the 19:57 semantics of my bank like what behavior 19:59 do I expect so I'm going to introduce 20:03 another thread here I'll call this one 20:04 the audit thread and what this is going 20:06 to do is every once in a while I'll 20:07 check it check the sum of all the 20:09 accounts in my bank and make sure that 20:10 the sum is the same as what it started 20:12 out as 20:13 right click if I only allow transfers 20:14 within my bank the total amount should 20:15 never change so now given this other 20:18 thread so what this does is it grabs the 20:20 lock then sums up Alice Plus Bob and 20:22 compares it to the total and if it 20:24 doesn't match then it says that though 20:25 I've observed some violation that my 20:27 total is no longer what it should be if 20:34 I run this code I actually see that a 20:36 whole bunch of times 20:37 this concurrently running thread does 20:39 indeed observe that Alice Plus Bob is 20:41 not equal to the overall sum so what 20:43 went wrong here like we're following our 20:45 basic rule of whenever we're accessing 20:47 data that's shared between threads we 20:49 grab a lock it is indeed true that no 20:52 updates to these shared variables happen 20:54 while the lock is not held exactly so 21:15 let me repeat that for everybody to hear 21:16 what we intended here was for this 21:19 decrement and increment to happen 21:21 atomically but instead of what we ended 21:23 up writing was code that decrement 21:25 atomically and then increments 21:27 atomically and so in this particular 21:28 code actually like we won't lose money 21:30 in the long term like if we let these 21:32 threads run and then wait till they 21:34 finish and then check the total it will 21:36 indeed be what it started out as but 21:37 while these are running since this 21:39 entire block of code is not atomic we 21:41 can temporarily observe these violations 21:44 and so at a higher level the way should 21:46 think about locking is not just like 21:49 locks are to protect access to shared 21:51 data but locks are meant to protect 21:53 invariants you have some shared data 21:55 that multiple people might access and 21:57 there's some properties that hold on 21:58 that shared data like for example here I 22:00 is the programmer decided that I want 22:02 this property that alice + Bob should 22:04 equal some constant and that should 22:05 always be that way I want that property 22:07 to hold but then it may be the case that 22:09 different threads running concurrently 22:10 are making changes to this data and 22:12 might temporarily break this invariant 22:14 here right like here when I decrement 22:16 from Alice temporarily the sum Alice 22:19 Plus Bob has changed but then this 22:20 thread eventually ends up restoring this 22:22 invariant here and so locks are meant to 22:25 protect and vary 22:26 at a high level you grab a lock then you 22:27 do some work that might temporarily 22:29 break the invariant but then you restore 22:31 the invariant before you release the 22:32 lock so nobody can observe these in 22:34 progress updates and so the correct way 22:36 to write this code is to actually have 22:38 less use of lock and unlock 22:39 we have lock then we do a bunch of work 22:41 and then we unlock and when you run this 22:43 code we see no more printouts like this 22:48 that we never have this audit thread 22:50 observe that the total is not what it 22:52 should be all right so that's the right 22:55 way to think about locking at kind of a 22:58 high level you can think about it as 23:00 make sure you grab locks when every 23:02 access shared data like that is a rule 23:03 but another important rule is locks 23:06 protect invariants so grab a lock 23:09 manipulate things in a way that might 23:10 break the invariants but restore them 23:12 afterwards and then release the lock 23:15 another way you can think about it is 23:17 locks can make regions of code atomic 23:19 not just like single statements or 23:21 single updates to shared variables any 23:26 questions about that great so the next 23:30 synchronization primitive we're going to 23:32 talk about it something called condition 23:34 variables and this is it seems like 23:37 there's been a source of confusion from 23:38 lab one where we mentioned condition 23:39 variables but didn't quite explain them 23:40 so we're going to take the time to 23:41 explain them to you now and we're going 23:43 to do that in the context of an example 23:45 that you should all be familiar with 23:47 counting votes so remember in lab 2a you 23:51 have this pattern where whenever Raph 23:53 Pierre becomes a candidate it wants to 23:54 send out vote requests all of its 23:56 followers and eventually the followers 23:58 come back to the candidate and say yes 24:01 or no like whether or not the candidate 24:02 got the vote right and one way we could 24:04 write this code is have the candidate in 24:06 serial ask Pierre number one car number 24:09 two for number three and so on but 24:10 that's bad right because we want the 24:12 candidate ask all the peers in parallel 24:14 so it can quickly win the election when 24:15 possible and then there's some other 24:17 complexities there like when we ask all 24:19 the peers in parallel we don't want to 24:21 wait so we get a response from all of 24:22 them before making up our mind right 24:24 because if a candidate gets a majority 24:25 of votes like it doesn't need to wait 24:27 till it hears back from everybody else 24:29 so this code is kind of complicated in 24:31 some ways and so here here's a kind of 24:34 stubbed out version of what that vote 24:37 counting code might look like 24:39 with a little bit of infrastructure to 24:40 make it actually run and so here have 24:41 this mean goroutine that sets count 24:43 which is like the number of yes votes I 24:44 got to zero and finish to zero finished 24:47 as the number of responses I've gotten 24:48 in total and the idea is I want to send 24:50 out vote requests in parallel and keep 24:52 track of how many yeses I've got and how 24:54 many responses I've gotten in general 24:55 and then once I know whether I've won 24:58 the election or whether I know that I've 24:59 lost the election then I can determine 25:01 that and move on and like the real raft 25:03 code you actually do whatever you need 25:05 to do don't step up to a leader or to 25:07 step down to a follower after you have 25:10 the result from this and so looking at 25:12 this code here I'm going to in parallel 25:15 spun say I have ten peers in parallel 25:17 spawn ten goroutines 25:18 here I pass in this closure here and I'm 25:20 gonna do is request a vote and then if I 25:23 get the vote I'm going to increment the 25:24 count by one and then I'm also going to 25:26 increment this finished by one so like 25:28 this is a number of yeses this is total 25:30 number of responses I've gotten and then 25:32 outside here in the main go routine what 25:34 I'm doing is keeping track of this 25:35 condition I'm waiting for this condition 25:36 to become true that either I have enough 25:38 yes votes that I've won the election or 25:40 I've heard back from enough peers and I 25:42 know that I've lost and so I'm just 25:44 going to in a in a loop check to see and 25:47 wait until count is greater than or 25:49 equal to five or wait until finished is 25:51 equal to ten and then after that's the 25:53 case I can either determine that I've 25:54 lost drive one so does anybody see any 25:56 problems with this code given what we 25:58 just talked about about mutexes yes 26:04 yeah exactly 26:06 countin finished aren't protected by 26:07 mutexes so one thing we certainly need 26:10 to fix here is that whenever we have 26:13 shared variables we need to protect 26:15 access with new taxes and so that's not 26:17 too bad to fix here I declare mutex 26:21 that's accessible by everybody and then 26:23 in the go routines I'm launching in 26:25 parallel to request votes I'm going to 26:27 and this this pattern here is pretty 26:29 important I'm going to first request a 26:30 vote while I'm not holding the lock and 26:33 then after wear that I'm going to grab 26:34 the lock and then update these shared 26:35 variables and then outside I have the 26:40 same patterns as before except I make 26:41 sure to lock and unlock between reading 26:43 these shared variables so in an infinite 26:45 loop I grab the lock and check to see if 26:48 the results of the election have been 26:49 determined by this point and if not I'm 26:51 going to keep running in this infinite 26:52 loop otherwise I'll unlock and then do 26:57 what I need to do outside of here and so 27:01 if I run this example whoops it seems to 27:09 work and this is actually like a correct 27:12 implementation it does the right thing 27:14 but there's some problems with it so can 27:16 anybody recognize any problems with this 27:18 implementation I'll give you a hint this 27:22 code is not as nice as it could be 27:32 so not quite it's going to wait for 27:35 exactly the right amount of time the 27:37 issue here is that it's busy waiting 27:39 what it's doing is in a very tight loop 27:41 it's grabbing the lock checking this 27:43 condition unlocking grabbing this lock 27:45 checking this condition unlocking and 27:47 it's going to burn up 100% CPU on one 27:49 core while it's doing this so this code 27:51 is correct but it's like at a high level 27:54 we don't care about efficiency like CPU 27:55 efficiency for the purpose of the labs 27:57 but if you're using a hundred percent of 27:59 one core you might actually slow down 28:01 the rest of your program enough that it 28:02 won't make progress and so that's why 28:05 this pattern is bad that we're burning 28:06 up a hundred percent CPU waiting for 28:08 some condition to become true right so 28:10 does anybody have any ideas for how we 28:12 could fix this so here's one simple 28:18 solution 28:18 I will change a single line of code all 28:23 I've added here is wait for 50 28:25 milliseconds and so this is a correct 28:28 transformation of that program and it 28:30 kind of seems to solve the problem right 28:32 like before I was burning up a hundred 28:33 percent CPU now only once every 50 28:36 milliseconds I'm going to briefly wake 28:37 up check this condition and go back to 28:39 sleep 28:40 if it doesn't hold and so this is like 28:43 basically a working solution any 28:46 questions so this kind of sorta works 28:51 but one thing you should always be aware 28:53 of whenever you write code is magic 28:55 constants why is this 50 milliseconds 28:58 why not a different number like whenever 29:00 you have an arbitrary number in your 29:01 code it's a sign that you're doing 29:02 something that's not quite right or not 29:04 quite as clean as it could be and so it 29:07 turns out that there's a concurrency 29:08 primitive designed to solve exactly this 29:10 problem of I have some threads running 29:12 concurrently that are making updates to 29:15 some shared data and then I have another 29:17 thread that's waiting for some property 29:19 some condition on that shared data to 29:21 become true and until that condition 29:22 becomes true the thread is just going to 29:24 wait there's a tool designed exactly to 29:26 solve this problem and that's a tool 29:28 called a condition variable and the way 29:33 you use a condition variable is the 29:36 pattern basically looks like this so we 29:38 have our lock from earlier condition 29:40 variables are associated with locks so 29:43 we have some shared data some a lock 29:46 that protects that shared data and then 29:48 we have this condition variable that is 29:49 given a pointer to the lock when it's 29:51 initialized and we're going to use this 29:53 condition variable for kind of 29:54 coordinating when a certain condition 29:56 some property on that shared data when 29:58 that becomes true and the way we modify 30:02 our code is like we have two places one 30:05 we're making changes to that data which 30:07 might make the condition become true and 30:08 then we have another place where we're 30:10 waiting for that condition to become 30:11 true and the general pattern is whenever 30:14 we do something that changes the data we 30:17 call a conduct broadcast and we do this 30:20 while holding the lock and then on the 30:22 other side where we're waiting for some 30:24 condition on that share data to become 30:25 true we call con dot wait and so what 30:28 this does is like let's think about what 30:30 happens in the mean thread for a moment 30:32 the main thread grabs the lock it checks 30:34 this condition suppose it's false it 30:36 calls con dot wait what this will do is 30:38 it will atomically you can think of it 30:40 as it'll release the lock in order to 30:42 let other people make progress and it'll 30:44 add its thread like it'll add itself to 30:46 a like list of people who are waiting on 30:48 this condition variable then 30:50 concurrently one of these threads might 30:52 be able to acquire the lock after it's 30:54 gotten a vote and then it manipulates 30:56 these variables and then it calls 30:58 conduct broadcast what that does is it 31:00 wakes up whoever's waiting on the 31:03 condition variable and so once this 31:05 thread unlocks the mutex this one what 31:08 do we want as it's returning from wait 31:10 we'll reacquire the mutex and then 31:13 return to the top of this for loop which 31:15 is checking this condition so this 31:18 broadcast wakes up whoever's waiting at 31:20 this wait and so this avoids having to 31:25 have that time dot sleeve for some 31:27 arbitrary amount of time like this 31:29 thread that's waiting for some condition 31:30 to become true only gets woken up when 31:32 something changes that might make that 31:34 condition become true right like if you 31:36 think about these threads if they're 31:37 very slow and they don't call conned out 31:40 broadcast for a long time this one will 31:42 just be waiting it won't be like 31:43 periodically waking up and checking some 31:44 condition that can't have changed 31:46 because nobody else manipulated their 31:48 shared data so any questions about this 31:52 pattern yeah 32:16 so that's a great question I think 32:17 you're referring to something called the 32:19 lost wake up problem and this is a topic 32:21 in operating systems and we won't talk 32:23 about it in detail now there feel free 32:24 to ask me after lecture but at a high 32:26 level you can avoid funny race 32:28 conditions that might happen between 32:29 wait and broadcast by following the 32:31 particular pattern I'm showing here and 32:32 I'll show you an abstracted version of 32:33 this pattern in a moment basically the 32:36 pattern is for the side that might make 32:39 changes that will change the outcome of 32:41 the condition test you always lock then 32:44 manipulate the data then call broadcast 32:47 and call unlock afterwards so the 32:49 broadcast must be called while holding 32:51 the lock similarly when you're checking 32:53 the condition you grab the lock then 32:55 you're always checking the condition in 32:56 a loop and then inside so when that 32:59 condition is false you call Condit wait 33:01 this is only called while you're holding 33:03 the lock and it atomically releases the 33:05 lock and kind of schedule like puts 33:07 itself in a list of waiting threads and 33:09 then as waits returning so as we like 33:12 return from this wait call and then go 33:13 back to the top of this for loop it will 33:15 reacquire the lock so this check will 33:16 only happen while holding the lock and 33:18 then so outside of this we still have 33:19 the lock here and we unlock after we're 33:21 done doing whatever we need to do here 33:24 at a high level this pattern looks like 33:26 this so we have one thread or some 33:28 number of threads doing something that 33:29 might affect the condition so they're 33:31 going to grab a lock do the thing call 33:33 broadcast then call unlock and on the 33:36 other side we have some thread that's 33:37 waiting for some condition to become 33:38 true the pattern there it looks like we 33:40 grab the lock then in a while loop while 33:42 the condition is false we wait and so 33:44 then we know that when we get past this 33:46 while loop now the condition is true and 33:48 we're holding the lock and we can do 33:50 whatever we need to do here and then 33:51 finally we call unlock so we can talk 33:54 about all the things that might go wrong 33:55 if you violate one of these rules like 33:57 after lecture if you're interested but 33:59 at a high level if you follow this 34:00 pattern then you won't need to deal with 34:02 those issues so any questions about that 34:08 yeah 34:14 so that's a great question 34:15 when do you use broadcast versus when do 34:17 use signals so converse have three 34:19 methods on them one is wait for the 34:21 waiting side and then on the other side 34:23 you can use signal or broadcast and the 34:25 semantics of those are signal wait wakes 34:27 up exactly one waiter like one thread 34:30 that may be waiting 34:30 whereas broadcast wakes up everybody 34:32 who's waiting and they'll all reach out 34:34 like they'll all try to grab the law can 34:35 recheck the condition and only one of 34:37 them will proceed because only one a 34:38 little he'll door lock until it gets 34:39 past this point I think for the purpose 34:42 of this class always use broadcast never 34:44 use signal if you follow this pattern 34:45 and just like don't use signal and 34:46 always use broadcast your code will work 34:48 I think you can stick think of signal as 34:51 something used for efficiency and we 34:54 don't really care about that level of 34:55 CPU efficiency in the labs for this 34:57 class 35:00 any more questions ok so the final topic 35:06 we're going to cover in terms of go 35:07 concurrency primitives is channels so 35:10 two high level channels are like a queue 35:12 like synchronization primitive but they 35:14 don't behave quite like cues in the 35:18 intuitive sense like I think some people 35:22 think of channels is like there's this 35:23 data structure we can sticks that stick 35:25 things in and eventually someone will 35:26 pull those things out but in fact 35:28 channels have no queuing capacity they 35:31 have no internal storage basically 35:34 channels are synchronous if you have to 35:36 go routines that are going to send and 35:37 receive on a channel if someone tries to 35:39 send on the channel while nobody's 35:41 receiving that thread will block until 35:43 somebody's ready to receive and at that 35:45 point synchronously it will exchange 35:47 that data over to the receiver and the 35:50 same is true the other direction if 35:51 someone tries to receive from a channel 35:53 while nobody's sending that receive will 35:54 block until there's another goroutine 35:56 that's about to send on the channel and 35:58 that send will happen synchronously so 36:00 here's a little demo program that 36:01 demonstrates this here I have a I 36:04 declare channel and then I spawn a go 36:06 routine that waits for a second and then 36:08 sent and then receives from a channel 36:11 and then in my main girl routine I keep 36:14 track of the time then I send on the 36:16 channel so I just put some dummy data 36:17 into the channel and then I'm going to 36:19 print out how long the send took 36:25 and if you think of channels as cues 36:29 with internal storage capacity you might 36:31 think of this thing as completing very 36:32 fast but that's not how channels work 36:35 this send is going to block until this 36:38 receive happens and this one happened 36:39 till this one second is the elapsed and 36:41 so from here to here 36:42 we're actually blocked in the main goo 36:44 routine for one whole second alright so 36:48 don't think of channels as queues think 36:50 of them as this synchronous like the 36:51 synchronous communication mechanism 36:55 another example that'll make this really 36:57 obvious is here we have a goroutine that 36:59 creates a channel then sends on the 37:01 channel and tries receiving from it 37:02 doesn't anybody know what'll happen when 37:04 I try running this 37:05 I think the file name might give it away 37:15 yeah exactly the send is going to block 37:18 till somebody's ready to receive but 37:19 there is no receiver and go actually 37:22 detects this condition if all your 37:24 threads are sleeping it to text this is 37:25 a deadlock condition and it'll actually 37:27 crash but you can have more subtle bugs 37:29 where if you have some other thread like 37:31 off doing something if I spawn this go 37:36 routine that you know for loop does 37:38 nothing and I try running this program 37:41 again now it goes deadlock detector 37:44 won't notice that all threads are not 37:45 doing any use will work like there's one 37:46 thread running it's just this is never 37:48 receiving and we can tell by looking at 37:50 this program that it'll never terminate 37:51 but here it just looks like it hangs so 37:54 if you're not careful with channels you 37:56 can get these subtle bugs where you have 37:58 double X as a result yeah yeah exactly 38:05 there's no data nobody's sending on this 38:07 channel so this is gonna block here it's 38:08 never gonna get to this line 38:19 yeah so channels as you pointed out 38:22 can't really be used just within a 38:23 single goroutine it doesn't really make 38:25 sense because in order to send or in 38:27 order to receive there has to be another 38:28 go routine doing the opposite action at 38:30 the same time so if there isn't you're 38:32 just gonna block forever and then that 38:34 chant but thread will no longer do any 38:35 useful work yeah sans wait for receives 38:44 receives wait for signs and it happens 38:45 synchronously once there's both the 38:47 sender and receiver present what I 38:53 talked about so far is unbuffered 38:54 channels I was going to avoid talking 38:56 about buffered channels because there 38:57 are very few problems that they're 38:58 actually useful for solving so buffered 39:00 channels can take in a capacity and then 39:04 you can think of it as it's just switch 39:07 this to so here's a buffered channel 39:09 with a capacity of one this program does 39:11 terminate because buffered channels are 39:14 like they have some internal storage 39:16 space and until that space fills up 39:18 sends are non blocking because they can 39:20 just put that data in the internal 39:21 storage space but once the channel does 39:23 fill up then it does behave like a nun 39:26 buffer channel in the sense that further 39:28 sends will block until there's a receive 39:30 to make space in the channel but I think 39:34 at a high level we should avoid buffered 39:35 channels because they basically don't 39:37 solve any problems and another path and 39:39 other things should be thinking about is 39:40 whenever you to make up arbitrary 39:41 numbers like this one here to make your 39:43 code work you're probably doing 39:44 something wrong yeah 40:00 so I think this is a question about 40:02 terminology like what exactly does 40:04 deadlock mean into this count as a 40:05 deadlock like yes this counts as a 40:06 deadlock like no useful progress will be 40:08 made here like this these threads are 40:10 just stuck forever 40:12 any other questions so what our channel 40:16 is useful for I think channels are 40:18 useful for a small set of things like 40:20 for example I think for producer 40:23 consumer queues sort of situations like 40:25 here I have a program that makes a 40:26 channel and this spawns a bunch of 40:28 goroutines that are going to be doing 40:29 some work like say they're competing 40:31 some result in producing some data and I 40:33 have a bunch of these go routines 40:34 running in parallel and I want to 40:36 collect all that data as it comes in and 40:37 do something with it 40:38 so this do work thing just like waits 40:40 for a bit and produces a random number 40:42 and in the main goroutine I'm going to 40:43 continuously receive on this channel and 40:45 print it out like this is a great use of 40:47 channels another good use of channels is 40:50 to achieve something similar to what 40:52 wait groups do so rather than use a wait 40:56 group suppose I want to spawn a bunch of 40:57 threads and wait till they're all done 40:58 doing something one way to do that is to 41:01 create a channel and then I spawn a 41:03 bunch of threads and know how many 41:04 threads I've spawned so five goroutines 41:06 created here they're going to do 41:07 something and then send on this channel 41:09 when they're done and then in the main 41:11 go routine I can just receive from that 41:13 channel the same number of times and 41:15 this has the same effect as a wait group 41:22 so question so what exactly is the 41:31 question 41:33 [Music] 41:37 so the question is here could you use a 41:39 buffered channel with a capacity of five 41:41 because you're waiting for five receives 41:43 I think in this particular case yes that 41:45 would have the equivalent effect but I 41:47 think there's not really a reason to do 41:49 that 41:50 and I think at a high level in your code 41:52 you should avoid buffer channels and 41:53 also maybe even channels unless you 41:55 think very hard about what you're doing 41:57 yeah so what is a weight group I think 42:08 we covered this in a previous lecture 42:09 and I talked about it very briefly today 42:11 but I do have an example of weight 42:14 groups so a weight group is a yet 42:18 another synchronization primitive 42:20 provided by go in the sync package and 42:21 it kind of does what his name advertises 42:24 like it lets you wait for a certain 42:25 number of threads to be done the way it 42:27 works is you call weight group dot add 42:29 and that basically increments some 42:31 internal counter and then when you call 42:33 weight group dot weight it waits till 42:35 done has been called as many times as ad 42:38 was called so this code is basically the 42:42 same as the code I just showed you that 42:43 was using a channel except this is using 42:45 weight group they have the exact same 42:47 effect you can use either one yeah 43:01 so the question here is about race 43:04 conditions I think like what happens if 43:06 this ad doesn't happen fast enough 43:09 before this weight happens or something 43:11 like that well so here notice that the 43:13 pattern here is we call weight group 43:15 data outside of this go routine and it's 43:19 called before spawning this go routine 43:21 so this happens first this happens next 43:24 and so we'll never have the situation 43:26 we're done happens after this ad happens 43:29 for this particular routine how's this 43:51 implemented by the compiler and I will 43:54 not talk about that now but talk to me 43:55 after class or in office hours but I 43:57 think for the purposes class like you 43:59 need to know the API for these things 44:00 not the implementation all right and so 44:04 I think that's basically all I have on 44:07 go concurrency primitives so one final 44:12 thought is on channels like channels are 44:13 good for a specific set of things like I 44:15 just showed you the producer consumer 44:16 queue or like implementing something 44:17 like weight groups but I think when you 44:19 try to do fancier things with them like 44:21 if you want to say like kick another go 44:24 routine that may or may not be waiting 44:25 for you to be like woken up that's a 44:27 kind of tricky thing to do with channels 44:28 there's also a bunch of other ways to 44:30 shoot yourself in the foot with them I'm 44:31 going to avoid showing you examples of 44:33 bad code with channels just because it's 44:35 not useful to see but I personally avoid 44:37 using channels for the most part and 44:39 just use shared memory and mutexes and 44:42 condition variables and set and I 44:43 personally find those much easier to 44:44 reason about so feel free to use 44:48 channels for when they make sense but if 44:49 anything looks especially awkward to do 44:51 with channels like just use mutexes and 44:52 condition variables and they're probably 44:53 a better tool yeah 45:02 so the question is with the difference 45:04 between this producer-consumer pattern 45:06 here in a thread-safe FIFO I think 45:07 they're kind of equivalent like you 45:09 could do this with the thread-safe FIFO 45:11 and it like that is basically what a 45:14 like buffered channel is roughly if 45:35 you're in queueing things in Indy 45:37 queueing things like if you want this 45:38 line to finish and have this thread go 45:40 do something else while that data sits 45:41 there in a queue rather than this girl 45:43 routine waiting to send it then a 45:45 buffered channel might make sense but I 45:48 think at least in the lives you will not 45:50 have a pattern like that all right so 45:53 next Fabian's going to talk about more 45:56 rapidly related stuff do you need this 46:13 all right can you all hear me is this 46:16 working yeah all right so yeah basically 46:24 I'm going to show you two bugs that we 46:27 commonly see in people's raft 46:29 implementations there's a lot of bugs 46:30 that are pretty common but I'm just 46:32 going to focus on two of them so in this 46:35 first example we sort of have a start of 46:38 a raft implementation for that's sort of 46:41 like what you might see for to a just 46:43 the beginnings of one 46:43 so in our raft state we have primarily 46:48 the current status of the raft pier 46:50 either follower candidate or leader and 46:52 we have these two state variables that 46:54 were keeping track of the current term 46:55 and who we voted for in the current term 46:58 so I'm I want us to focus though on 47:01 these two functions attempt election and 47:04 call request vote so in a temptin we're 47:07 just going to set our state to candidate 47:10 increment our current term vote for 47:13 ourselves and then start sending out 47:15 request votes to all of our raft peers 47:17 and so this is similar to some of the 47:19 patterns that Anish showed where we're 47:23 going to loop through our peers and then 47:25 for each one in a go routines separately 47:28 call this call request vote function in 47:30 order to actually send an RPC to that 47:32 peer 47:33 alright so in call request vote we're 47:36 going to acquire the lock prepare 47:40 arguments for our request Ville RPC call 47:42 based on by setting it to the current 47:44 term and then actually perform the RPC 47:47 call over here and finally based on the 47:49 response we will reply back to this this 47:54 attempt election function and the 47:55 attempt election function eventually 47:56 should tally up the votes to see if it 47:58 got a majority of the votes and can 48:00 become leader so what happens when we 48:04 run this code so in theory what we might 48:06 expect to happen is four so there's 48:08 going to be some code that's going to 48:10 spawn a few graph spears and actually 48:11 try to attempt elections on them and 48:13 what should happen are we just start 48:18 collecting votes from other peers and 48:19 then we're not actually going to tally 48:21 them up 48:21 but hopefully nothing weird goes wrong 48:24 but actually something is going to go 48:26 wrong here and we actually activated 48:29 goes deadlock detector and somehow we 48:31 ran into a deadlock so let's see what 48:33 happened for now let's focus on what's 48:37 going on with the server zero so server 48:40 zero it says it starts attempting an 48:42 election at term one that's just 48:45 starting the attempt election function 48:47 it will acquire the lock set some of the 48:49 set some stuff up for performing the 48:50 election and then unlock then it's going 48:57 to send out a request vote RPC - server 48:59 - it finishes processing that request 49:03 vote RPC over here so we're just 49:05 printing right before and after we 49:07 actually send out the RPC and then it 49:09 sends out a request vote RPC - server 49:11 one but after that it never we never 49:14 actually see it finish sending the 49:15 request vote RPC so it's actually stuck 49:18 in this function call waiting for the 49:21 RPC response from server 1 all right now 49:24 let's look at what's everyone's doing so 49:25 it's it's pretty much the same thing it 49:27 sends a request vote I received a server 49:29 to that that succeeds it finishes 49:32 processing that request vote the 49:34 response from server 2 then it sends 49:36 this RPC to zero and now what's actually 49:39 happening is 0 & 1 are sort of waiting 49:41 for the RPC responses from each other 49:43 they both sent out an RPC call but not 49:45 yet got the response yet and that's 49:48 actually sort of the cause of our 49:50 deadlock so really what's the reason 49:53 that we're dead locking is because we're 49:54 holding this lock through our RPC calls 49:57 over here in the core requests vote 49:59 function we acquire our mutex associated 50:02 with our raft peer and we only unlock at 50:04 the end of this function so throughout 50:06 this entire function we're holding the 50:07 lock including when we try to contact 50:10 our peer to get the vote and later when 50:17 we handle this request vote RPC we 50:22 actually only see it at the beginning of 50:24 this function in the handler we're also 50:26 trying to acquire the lock but we never 50:28 actually succeed in acquiring the lock 50:29 so just to make this a little bit more 50:31 clear the the sort of order of 50:34 operations 50:35 is happening is in call request vote 50:39 server zero is first going to acquire 50:42 the lock and send an RPC call to server 50:47 one and then simultaneously and 50:51 separately server one is going to do the 50:53 same thing it's going to enter its call 50:54 request vote function acquire the lock 50:56 and send this RPC call to server zero 51:01 now in server zeros handler and server 51:05 ones handler they're trying to acquire 51:07 the lock but they can't because they 51:10 already are acquiring the lock and 51:11 trying to send the RPC call to each 51:13 other and that that's actually what's 51:15 leading to the deadlock situation so to 51:18 solve this basically we want you to not 51:21 hold locks through RPC calls and that's 51:23 the solution to this problem in fact we 51:27 don't need the lock here at all instead 51:29 of trying to read the current term when 51:32 we enter this call request vote function 51:34 we can pass this as an argument here 51:38 save the term when we had acquired the 51:42 lock earlier in this attempt election 51:44 and just passed this as a as a variable 51:47 to call request vote so that actually 51:48 removes the need to acquire the lock at 51:51 all in call request vote alternatively 51:55 we could lock while we're preparing the 51:58 arguments and then unlock before 51:59 actually performing the call and then if 52:01 we need to to process the reply we could 52:04 lock again afterwards so it's just make 52:05 sure to unlock before making it 52:08 obviously call and then if you need to 52:10 you can acquire the lock again so now if 52:14 I save this then so it's still 52:20 activating the deadlock detector but 52:21 that's actually just because we're not 52:23 doing anything at the end but now it's 52:25 actually working 52:26 we finished sending the request votes on 52:28 both sides and all the operations that 52:29 we wanted to complete are complete all 52:32 right any questions about this example 52:42 yeah so not it's sort of so you might 52:45 need to use locks when you are preparing 52:47 the arguments or processing the response 52:49 but yeah you shouldn't hold a lock 52:50 through the RPC call while you're 52:52 waiting for the other peer to respond 52:54 and there's actually another reason to 52:56 that in addition to deadlock the other 52:58 problem is that in some tests we're 53:00 going to sort of have this unreliable 53:03 network that could delay some of your 53:05 RPC messages potentially by like 50 53:08 milliseconds and in that case if you 53:11 hold the lock through an RPC call then 53:13 any other operation that you try to do 53:15 during that 50 milliseconds won't be 53:17 able to complete until that RPC response 53:19 is received so that that's another issue 53:22 that you might run into if you hold the 53:23 long so it's both to make things more 53:25 efficient and to avoid these potential 53:27 deadlock situations 53:37 all right so just one more example this 53:41 is again using a similar draft 53:45 implementation so again in our raft 53:47 state we're going to be keeping track of 53:48 whether a fuller candidate leader and 53:49 then also these two state variables in 53:52 this example I want you to focus on this 53:54 attempt election function so now we've 53:57 first implemented the change that I just 53:59 showed you to store the term here and 54:02 pass it as a variable to our function 54:04 that collects the request votes but 54:06 additionally we've implemented some 54:07 functionality to add up the votes so 54:10 what we'll do is we'll create a local 54:12 variable to count the votes and whenever 54:16 we get a vote if the vote was not 54:18 granted 54:19 we'll return immediately from this go 54:20 routine where we're processing the boat 54:22 otherwise we'll acquire the lock before 54:25 editing this shared local variable to 54:28 count up the votes and then if we did 54:31 not get a majority of the votes will 54:32 return immediately otherwise we'll make 54:34 ourselves the leader so as with the 54:38 other example I mean initially if you 54:42 look at this if I look at this like it 54:43 seems reasonable but let's see if 54:45 anything can go wrong all right so this 54:50 is the log output from one run and one 54:53 thing you might notice is that we've 54:54 actually elected two leaders on the same 54:57 term so server zero 54:59 it was elected made itself a leader on 55:03 term two and server one did as well it's 55:06 okay to have a leader elected on 55:08 different terms but here where we have 55:09 one on the same term that that should 55:11 never happen alright so how did this 55:13 actually come up so let's start from the 55:15 top so at the beginning server zero 55:18 actually attempted an election at term 55:20 one not turn two and it got its votes 55:23 from both of the other peers but for 55:27 whatever reason perhaps because those 55:28 reply messages from those peers were 55:30 delayed it didn't actually process its 55:34 process those votes until later and in 55:38 between receiving it like in between 55:40 attempting the election and finishing 55:42 the election server one also decided to 55:45 attempt an election perhaps because 55:47 because of server zero was delayed so 55:49 much server one might 55:50 actually ran into the election timeout 55:52 and then started its own election and it 55:54 started it on term 2 because it couldn't 55:57 have been termed 1 because it already 55:59 voted for server 0 on on term 1 over 56:01 here 56:03 okay so then server 1 sends out its own 56:08 request votes 2 servers 2 and 0 at term 56:11 2 and now we see that server two votes 56:14 for server 1 that's fine but server 0 56:16 also votes for server 1 this is actually 56:18 also fine because server one is asking 56:21 server 0 for a vote on a higher term and 56:25 so what server 0 should do is if you 56:28 remember from the spec it should set its 56:33 current term to that term in the request 56:35 for RPC message to term 2 and also 56:37 revert itself to a follower instead of a 56:39 candidate alright finally so the real 56:43 problem is that on this line where 56:44 server 0 although it really got enough 56:47 votes on term 1 it made itself a leader 56:49 on term - so the reason so one 56:53 explanation for why this is happening is 56:55 because in between where we set up the 56:57 election our attempt for the election 57:00 and where we actually process the votes 57:02 some other things are happening input in 57:05 this case we're actually voting for 57:07 someone else in between and so we're no 57:11 longer on term 1 where we thought we 57:12 started the election we're now on term 2 57:14 and so we just need a double check that 57:17 because we don't have the lock while 57:19 we're performing the RPC calls which is 57:21 important for its own reasons now some 57:23 things might have changed and we need to 57:24 double check that what we assume is true 57:26 when we're setting ourselves to the 57:28 leader is still true so one way to solve 57:32 this that there's a few different ways 57:34 like to solve this like you could 57:35 imagine not voting for others while 57:36 we're in the middle of attempting an 57:38 election but in this case the simplest 57:39 way to solve this at least in this 57:42 implementation is to just double check 57:45 that we're still on the same term and 57:46 we're still a candidate we haven't 57:48 reverted to a follower so actually one 57:50 thing I want to show you is if we do 57:52 print out our state over here then we do 57:57 see that server 0 became a follower but 58:00 it's still setting itself to a leader on 58:02 this line 58:04 so yeah we can just check for that if 58:07 we're not a candidate or the current 58:10 term doesn't match the term which we 58:12 started the election then let's just 58:14 quit and if we do that then 58:18 so everyone becomes a leader and we 58:20 never cease over zero become leader so 58:22 the problem solved any question yeah 58:28 yeah I think I think that would I 58:30 because we would not if the term is 58:35 higher now than actually no it would it 58:38 might not be sufficient because we might 58:40 have attempted another election it 58:42 depends on your implementation but it's 58:44 possible that you could have attempted 58:47 another election on a higher term 58:49 afterwards all we know that's the same 58:51 thing right yeah it would not be 58:52 sufficient to only check the state but I 58:54 think you're right if you only check the 58:56 term then it is sufficient all right any 59:01 other questions all right so yeah that's 59:09 it for this part she's going to show you 59:11 some more examples of actually debugging 59:14 some of these draft implementations 59:34 hi can you all hear me yeah 59:52 is it not 60:06 okay so in my section I'm gonna walk you 60:14 through how I would be but if you have 60:17 like a bug in your rough implementation 60:20 so I prepare a couple of baccara good 60:24 and I just try to walk you through it so 60:30 first I'm gonna go into my face 60:33 Bucky implementation and if I run the 60:41 test here so for this one it doesn't 60:52 print anything it just gets started and 60:55 it's gonna be here forever and let's 61:00 assume that I have no idea why there's 61:02 happening 61:03 the first thing that I want to find out 61:07 is where it gets started and we we do 61:13 have a good tool for that which printf 61:16 but in the stop code if you go to 61:21 youtube go we have a function called the 61:25 printf this is just a nice wrapper 61:28 around the block cleaners with the 61:32 debugger able to enable or disable the 61:36 locking messages so I'm gonna enable 61:40 that and go back to my graph okay so 61:46 first of all when i when when there 61:50 there's something that's but happening I 61:54 always go check if the code actually 62:02 actually initialize graph server so here 62:08 I'll just clean 62:20 okay so here if I run the test again 62:27 then now I know that there are three 62:31 servers that God he initialized so this 62:36 files is okay but like there's nowhere 62:43 where the bow is happening so I'll just 62:45 go deeper into the hood just to find 62:48 where it gets stuck so now if you see 62:50 the code we are calling the leader a 62:55 tech election so I'm gonna go to that 62:58 function and just to make faster I'll 63:06 try to check if it kicks off some 63:08 iteration 63:21 that part still fine so we we try to go 63:25 for now here we are in the election I'll 63:31 see if there's so we actually send the 63:37 request voice to some other servers 64:00 now we kind of have like more idea of 64:02 where guests are because it's not 64:04 printing that some sorry that kicks off 64:08 the election are not sending the request 64:11 words so I would go back for her just to 64:17 see where customers like I always tried 64:21 here prin if if we call some function 64:27 aye-aye 64:29 I was always double shake if it actually 64:31 go into the function so now I'm going to 64:37 say that this service is at the start of 64:42 the election 64:50 and that works so now we have an idea of 64:56 like the box should be between here and 65:02 here so we are trying to minimize the 65:06 scope of the code that's causing the bug 65:14 let's say if I pin something here 65:28 and it does it doesn't get there so I 65:33 move it up let's say here still not 65:41 there 65:48 now it's there so the bug is probably in 65:55 this function and I just go check so 66:02 here the problem is that I'm trying to 66:05 acquire a lock where I actually do have 66:08 the lock so it's gonna be a day long so 66:13 that's how I will find their first bug 66:16 using the D printers and it's it's nice 66:22 to use the printf because you can like 66:29 just turn off the debugging print and 66:33 have a nice test output with our audit 66:38 debugging if you want it so that's how I 66:43 would use it deep enough to try to like 66:47 handle a bug in your code and for this 66:51 example there's actually another trick 66:54 to help you find this kind of deadlock 66:59 so if you press ctrl + backslash you can 67:04 see in the bottle but bottom left that I 67:09 press like control and backslash this 67:13 this command will send a signal quit 67:16 today 67:17 go program and by default it will 67:21 handles the the quiz signal and quit all 67:26 the go routines and print audio strike 67:29 the stack rates so now this like Chico 67:41 up here like this way it gets touched 67:43 and then there are gonna be a couple 67:47 functions printing here 67:55 just trying to go through all the traces 68:07 yes so it's actually showing that the 68:11 function that's causing the problem is 68:14 the cover to candidate so that's another 68:17 weight you've to find out where the day 68:20 locks are I can remove all this 68:43 and now it works so that's the first 68:47 example that I want to go through second 68:51 thing that you want it you want to do 68:54 before you submit your labs is to turn 68:57 the race 68:58 flag on when you do the test the way to 69:03 do that is just to add - race before 69:07 - groin and here because my implement 69:18 implementation doesn't have any research 69:20 so it's not going to tell you anything 69:22 but this just be careful about this 69:25 because it's not a proof that you don't 69:29 have any really it's just that it cannot 69:33 detect races for you I'm going to run 69:42 the same command again with the red flag 69:45 but now this time that's actually risk 69:48 going on in my implementation so it's 69:56 gonna yell at you that there's some 70:00 deliveries going on in your code 70:08 I'm quitting that and let's see like how 70:13 useful is the warning are so I'm gonna 70:20 go to my second implementation with 70:27 Rothko and here 70:37 let's look at this race so it's telling 70:45 us that there's a wheat going on at the 70:48 line 70:49 103 I'm going to that line so this the 70:54 wheat on probably Thursday here and 71:08 there's also a right line for 12 which 71:20 is Thursday so 71:38 I'm going to this line again 71:45 and now we kind of know that this this 71:48 radiation is protected by a lock so the 71:53 risk flies actually wanting us and 71:56 helping us to find out but on on this 72:00 database that we have so the fake it's 72:05 gonna be just you lock this and unlock 72:15 it and that should solve the problem 72:28 so at this place we kind of know how to 72:31 basic like do some basic debugging does 72:35 anyone have any question no okay yeah so 72:42 I'm going to go to the third one which 72:46 is going to be more difficult to find 72:50 about I'm going to test the run the 73:01 centers and now I am I actually have 73:04 some debugging messages in there already 73:10 and just see that I also have a 73:17 debugging message with the test action 73:20 there's something you might want to 73:22 consider doing if you go into the test 73:34 clip here 73:42 you can just see how the test would run 73:46 and then there are some actions that the 73:49 test clip is gonna do to make your code 73:52 fail and it's usually a good idea to 73:58 print out where that action is happening 74:02 in your actual debugging message so you 74:07 can guess what is happening like where 74:13 the bug is happening in which phase of 74:18 the test if that make sense so now it's 74:22 like I was doing fine in the first case 74:27 I passed I passed the fail but I'm 74:30 failing their second tiers and here the 74:37 Test section is to found one as a little 74:40 one so I'm passing this the test until 74:46 this and if you go to I'm actually 74:51 passing until the leader two rejoins so 74:57 this can give you a nice idea of how the 74:59 test is working and just to help you 75:09 have a better case as where the bondage 75:13 is in your code so now let's look at the 75:21 debugging messages 75:32 so it's least it seems like when liturgy 75:35 we joined it becomes a follower and we 75:40 have a new leader 75:41 so that looks fine to me and we probably 75:46 need more debugging messages instead of 75:50 just their state changes so I am going 76:00 to add some more my first case that when 76:05 one becomes a leader it might not be 76:08 doing what a leader should you correctly 76:13 so we got stuck 76:23 so you might could after we cover it as 76:26 eventually there I have a go routine 76:30 call operate leader 76:32 there's just standing habit CD all set 76:34 to the audio servers so I'm gonna print 76:41 some stuff here saying happy - cheers 76:54 away 77:20 so to become solidary it sends the the 77:25 first happy to each server and one still 77:33 tries to send happy to the new leader 77:41 and then one becomes a follower so this 77:46 doesn't look like to be a problem now 77:54 I'm gonna check if the other service 77:56 receive habit correctly 79:25 it's taking away with I'm trying to 79:29 finish this yeah so to becomes a leader 79:37 to sends high bid but no one receive a 79:43 habit form - so if I go to the same 79:54 opinion tree I actually hold the law to 79:59 the RPC Hall which is the problem that 80:03 Fabian went to in the last section so 80:07 that's that's the problem that I need to 80:10 fix so what I should do is to a lot here 80:23 and then 80:33 lock again here and that should work 80:47 we pass and then there are couple things 80:53 that you might want to do when you test 80:58 your rough implementation so that's 81:03 actually script to run the test in 81:09 imperial and I can show you how I how we 81:14 can use how we can use it this creep is 81:18 in the inner peer support some someone 81:21 make a point about it and here's how we 81:27 can use the script so you run the script 81:33 specify the number of the test 81:36 personally I do like a 1000 but that 81:40 depends on your preference this is the 81:44 number of course that you wanna run the 81:47 test at the same time and then here's 81:49 the test and if you run the script then 81:59 if you show you that's like we have run 82:04 four tests so far all are working fine 82:09 and it's gonna keep going like that so 82:13 that's how I would go about debugging 82:17 rough implementation and you are all 82:19 welcome to come to office hours when you 82:22 need help