Transcript Search in video 0:00 all right good afternoon everybody let's get started you guys want to let us in on what's so 0:07 exciting just kidding anyways so we're 0:13 almost there today we're going to be talking about ash coherence we're 0:19 basically going to continue on the theme that we started last time which is talking about issues that come up when 0:27 you are actually have multiple processors that are trying to execute something in parallel and so last time 0:34 we focused on the synchronization problems and how do we deal with things 0:39 like precedence constraints limited resources as well as critical sections 0:46 which need to not be interrupted today we're actually going to look at what happens when you have multiple caches 0:52 operating in parallel potentially on shared data and how do we deal with that 0:58 so before we get started with that just a couple final reminders so 1:05 everything is due tomorrow I expect today's lecture to actually go a little 1:13 short so hopefully that'll give you a few extra minutes to go and work on your design project if you want and as far as 1:22 what's left for the course tomorrow's recitation is gonna cover 1:27 cache coherence problems which is material that's going to be covered on 1:32 our third quiz next week however the last lecture and the last recitation are 1:40 optional I am told that the last vector is going to be really fun so you should 1:45 come and the last recitation is basically going to be a we didn't want to start 1:51 going through the practice quizzes just because it's a little too soon before our quiz so we're going to hold an 1:58 evening quiz review session a few days before the quiz and we'll use the 2:03 Wednesday recitation for is just open questions on any topic that you might think that you need help with 2:11 what else oh yeah and don't forget also to fill out - the feedback on six double 2:19 O four okay any questions before we get started all right terrific 2:29 so we talked last time about so you can either have multiple processors or 2:35 multiple cores that are part of the same CPU and this picture here what we see is 2:41 multiple cores each of which has its own private cache and then we have a shared memory that's shared across them all so 2:49 as far as your programs are concerned we already saw you know last time that you 2:56 know dealing with synchronization is difficult enough when you're having multiple things running at the same time now you can only imagine that if on top 3:04 of that the processor had to each process had to keep track of what was in each one of the private caches in order 3:11 to make sure that it did the right thing that would be nearly impossible so what we want is for these private caches to 3:18 effectively appear to be invisible to the programmer so as far as you know the 3:25 program is concerned we want them to think that there is this one large shared memory and that everything should 3:33 be consistent across it and so today we're going to see what are the issues and how do we deal with that consistency 3:40 so first let's take a look at you know what is the actual problem if we don't 3:45 do anything so imagine that I have you know here at core zero which wants to do 3:50 a load operation from address a okay so issues a load operation to address a 3:57 main memory responds and it says the value at address a is two and so now caches that value in its in its couch 4:05 with some tag information associated with address a and if it gets a future 4:11 request for address a it can respond from the cache now Core 2 comes along 4:16 and it too wants to do something with address a it's going to do a store to address a ok and 4:22 typically it's going to store the value 3 into address 8 and so it too is going 4:28 to bring address a into its cache but now it's going to update it to 3 and now when we go back to you know core 0 which 4:36 thinks it actually already has address a in its cache and it goes to do another load what it's going to do is it's going 4:43 to read value 2 which is the scale value because it should read the updated value 4:48 that was written by core 3 which is 3 I mean the core 2 which is 3 ok so we have 4:55 to figure out how do we deal with avoiding the stale data so the way that 5:01 we do that is we use what's called a cache coherence protocol and so cache 5:06 coherence protocols are basically going to ensure that we don't have stale lines and the most common way of doing this is 5:15 through and we'll get some more details of this in a second but what we call invalidate mechanisms and so what what 5:23 happened is before you allow a rights to any particular address you first have to invalidate the caches that also hold 5:32 that address so you would first invalidate cache 0 copy of address a 5:38 then you would be able to write to core twos address a and you wouldn't have a 5:44 problem when core 0 goes back to reading it because now it would actually try to 5:49 go and fetch it from main memory because it's no longer in its cache all right so 5:55 a few things that we have to make sure of in order to handle cache coherence so 6:01 the first first of all we have to enforce two rules the first is right 6:07 propagation and what what right propagation says is that what I do are 6:12 right eventually that right has to become visible to all of the processors ok and 6:18 the second thing that we have to ensure is right serialization and what this 6:23 means is that if I'm doing multiple writes to the same address then all of the processors have to see those rights 6:30 in exactly the same order ok so how do we go about ensuring these two 6:37 things so first of all for right propagation we have two mechanisms the 6:42 more common one being the first which is what we just mentioned which is write-invalidate protocols so what this 6:48 means is that when somebody wants to do a right the first thing that you do is you invalidate the copy of that same 6:56 address from all of the other caches and then the processor has its own copy of 7:01 that address and so it's now able to do the write a an alternative to this is 7:07 what's called a write-update protocol and so in this case instead of invalidating their cache the cache that 7:14 performed the most recent write would have to also make sure that it 7:19 propagates its information so that the other caches can then update their caches to the most recent value as well 7:26 but we're going to be focusing on the first which is write and validate which is a lot simpler to implement and Shore 7:34 write serialization we use one of two protocols the more basic one is called 7:41 Snoopy based protocol and we're going to spend most of today's lecture on that but a more efficient and more scalable 7:49 one is a directory based protocol which we'll just touch upon at the end of 7:55 today's lecture all right so let's start with Snoopy caches so how does Snoopy caches work so 8:01 the idea is that I have these multiple processes running in parallel and each one of them has their own private cache 8:07 and we call this the it's actually a Snoopy cash and the caches are all 8:14 connected through a shared bus okay and then from there they can also connect to 8:19 main memory now in the past when we dealt with our caches the processor was 8:24 the one making requests of either loads or stores and the cache was responding if it had the value then it would you 8:31 know have a hit and return the data if it had a Miss it would then go to main memory and fetch the data from there in 8:38 this case things got a little bit more complicated because cassia's now actually have to listen to 8:43 to to sources they have to listen to their requests coming from the processors but they also have to listen 8:50 to what's going on in the bus okay and they need to respond accordingly for 8:56 based on on both of those so how do we 9:01 go about doing this so the first thing is we have to make sure that all of our 9:08 memory requests and our broadcasts on this bus are in order so that everybody 9:14 sees all loads and all stores in exactly the same order and then what happens is 9:19 that each cache controller is going to snoop which is where they get their name from I'm the boss in order to see what's 9:27 going on on the bus not only for its own transactions but also for what everybody else is doing and based on whatever what 9:33 else whatever is going on the bus it's going to potentially change the state of its cache lines so basically what we're 9:42 gonna do is we're going to implement a finite state machine which defines what 9:47 exactly each cache line and needs to do depending on its current state and what's going on in either the processor 9:55 request or the shared bus all right so we're gonna start off with a pretty 10:01 simple protocol this is just valid invalid so just like in our original 10:08 implementation of caches we had a concept of having a valid bit where the valid bit told you whether you know the 10:14 pass line that you have is valid or whether you know you just start it off or something and and it might be garbage 10:21 that's in there in in cache coherence we want to use the valid bit to indicate a 10:28 little bit more and so the way that the that we're going to use the validate 10:33 validate for cache coherence is as follows it's going to follow this FSM that's described here where we have two 10:40 states in valid and invalid so you start off in the invalid state and suppose 10:45 that the processor issues a read request or a load okay so if it is use a read 10:52 request then I'm going to fetch something from main memory the way that I do that is by 10:58 initiating a bus read request and so the bus read is going to go talk to main 11:03 memory it's going to get the data that it's requesting from a memory and bring it into the cache and when I bring it in 11:10 so the cache my state is now valid so I've moved to this valid state over here okay once I'm in the valid state that 11:17 means I brought the value from memory and so if I now get another read request I can just respond directly from my 11:25 cache and so that's what this feedback arrow here represents where it says if 11:31 we get another read request and we don't have to initiate any additional bus 11:36 transactions we can just respond from our cache however if we get a write 11:41 request then we have to initiate what's called a bus write transaction so the 11:48 idea is that in this implementation the caches that we're using are right through caches can anybody tell me if 11:56 they remember what right through passes yep 12:07 correct so you're always writing back to memory as well okay so basically what's 12:12 happening here is even when I have a valid I'm in the valid state if I'm 12:18 doing a write I am going to actually initiate a bus right so that my right is 12:24 also written to main memory okay now what is this dash line here mean 12:30 this means suppose that I'm another cache and I'm seeing on the bus a bus 12:36 right request initiated by somebody else what you know what do I need to do and 12:42 so if I was in a valid state and someone else is trying to now write a particular 12:49 location that I also have in my cache I need to now pay attention to this bus 12:54 right action and become invalid so that the only cash that has a valid copy of 13:04 that particular address is the one that's actually doing the writing any questions about that great and then 13:15 finally in in this particular protocol because it's a write through if we're in 13:21 the invalid state and we're trying to we get a process write request then we 13:27 simply write directly to memory and we stay in the invalid state and so we initiate this bus right action which 13:35 writes directly to memory okay so so far not too complicated so let's work 13:41 through a simple example just to make sure that you know all these 13:46 transactions make sense so imagine a situation where we just have two cores core 0 and core 1 13:53 each one of them has their own private cache and we have this shared main memory so imagine now that for 0 starts 14:01 off by doing a load of address a so what is that going to do based on our FSM 14:08 that's going to initiate a bus read request for address a and the memory C 14:14 bless read requests and is going to respond with what is an address a and so 14:20 that response is going to come back to the cache of course zero and it's going to change its state to the valid state 14:27 and it's going to say that for you know the time associated with address a the 14:32 value is 2 okay so now we proceed and now corwin also wants to load address a 14:40 so in this case once again it's going to initiate a bus read request and it too 14:47 is going to get a response and so it's going to change its state to valid and 14:53 also update its Casta to now if i get any further read requests from either 14:58 core 0 or core 1 I no longer need to initiate any more bus transactions 15:03 because I now have it in my cache and I have a valid copy in my cache so any further responses can come directly from 15:10 the cache but now suppose that the next 15:16 thing that happens is that core 0 wants to execute a store to address a ok we 15:23 said that only you know when we execute a store we need the other caches to 15:29 invalidate their copies ok so the way this is going to happen is that the 15:34 store is going to initiate a bus right action on our bus and that's going to 15:41 take this value 3 which we're trying to write in to address a an update main memory with it but in addition it's 15:49 going to go to core 1 and have core 1 invalidate its entry for address a in 15:55 the cache ok now once it invalidates its entry then the or and all if you had 16:00 more cores and then all of them invalidated their entries and now core 0 is the only one that actually has a copy 16:08 of address a that's valid in its cache and so now it's able to go ahead and 16:14 update it to 3 and everything will continue to be consistent across the 16:19 entire system so now if you know core 1 now does another load of a because its value was 16:27 made invalid it's going to issue another bus reader and it's going to get you know a 16:34 response that says okay now the new value of address a is three okay 16:40 yes because it's a write through cache 16:48 and so you're always writing to memory anyway so there's no particular needed 16:54 to bring it into your cost I mean if you were going to then use it for anything else you could also do another read 17:01 which would then bring the updated value into the cache in you don't you could I 17:10 mean and a right through pass you could do you could do it both ways okay so who 17:19 sees some problems with this VI simple protocol is it great should we stop here 17:29 and go home or can we do a little better yeah well we're it's true but it's not 17:46 actually updating we're invalidating which is a little bit faster than actually updating but the key to notice 17:53 here is that every time we're doing a write what happens to happen 18:00 I need to update my memory and I need to 18:05 initiate a bus transaction okay so regardless of whether I'm whether I 18:14 have you know the data in my cache or not I'm always right to writing back to main memory which is costly because I 18:20 don't have to necessarily write every single one of those values back and in addition every single write requires a 18:28 broadcast on this bus which is an expensive operation okay so we want to try to improve on that so 18:37 before we get to our next protocol let's talk a little bit about you know what do 18:43 we need to be true in order for our cache coherence to work properly so we 18:48 said that all of the loads and stories that are seen in our entire system have 18:54 to have a global ordering meaning that everybody sees them in exactly the same order now if we're not careful and 19:01 multiple caches have copies of the same address and you can imagine you know getting yourself into trouble where 19:08 that's not actually what's happening so in order to ensure that we do have this 19:13 global ordering we can do the following the first is that we only allow one 19:18 cache at a time to have write permission to an address so just like we saw in the 19:24 simple valid and validate valid and valid protocol when one of the caches 19:29 wanted to write to the location the other one if it had a valid copy of it had to become invalid okay and the 19:37 second thing is that at no point were any cache had a stale copy of the data 19:43 right so it's already going to go to the invalid mode and so it will no longer you know consider the old value to to be 19:52 the correct answer all right so we're going to introduce now a new protocol 19:57 which is a slightly modified version of the valid invalid and this is called MSI 20:03 which stands for a modified shared invalid and um so the invalid state basically means 20:11 the same thing which is you don't have a copy of that address in your cache the 20:17 shared bit or the shared state indicates that your cache has a particular address 20:25 in it but you are only allowed to execute reads to it because some other caches may have that address as well and 20:33 finally the modified state means that only this cache has the address and so 20:39 it's safe to do both reads and writes if you're in the modified state ok all 20:45 right so let's take a look at what the FSM for just MSI protocol looks like and 20:52 first of all let's remind ourselves what we said the drawbacks or the VI mechanism were so the drawback was that 20:59 every update were also wrote to main memory and every update caused a 21:04 broadcast on our bus so and that sign is gonna solve both of those issues so 21:10 first of all it's going to allow for us to have write back caches it doesn't have to be a write 21:15 through cache and some of our of our rights are going to be able to be 21:21 satisfied locally without having to initiate any bus transactions so let's 21:26 take a look at this now the state diagram might look a little more scary so let's go through what this is telling 21:33 us ok so first of all we have our three states M s and I the ones at the bottom means 21:41 we essentially have fewer permissions and as we move up in the states we have 21:46 more permissions so the modify is a state where that gives us the most permissions which allows us to actually 21:52 write to a particular location on the left here what we show here in orange 22:00 arrows are the processor initiated requests ok and so the processor 22:06 initiated a request can cause some bus transactions which is what's shown after 22:12 this slash but all this side of our of our FSM is 22:18 initiated by a processor request now you'll notice that when we have 22:24 processor requests what's basically happening is that the processor is getting us to have more permission for a 22:31 particular location and so what's happening is that we're moving up in this FSM from you know a lower state to 22:38 a higher state okay on the right hand side these dashed blue lines we see bus 22:44 initiated transactions okay so this means when you know some other cash its 22:51 responded to some processor request and as a result it puts some broadcast 22:59 something onto the bus then all the other caches that are seeing that information on the bus need to respond 23:05 accordingly okay so let's think a little bit of a closer look at you know what actually is 23:11 happening so we'll start off you know in the invalid state and let's say that I 23:16 want to read something right so my processor issues a processor read and 23:22 that's going to cause a bus read request in order to tell everybody else that I'm 23:30 reading this particular location and now I moved into the shared State so I am 23:37 able to now read it but not modify it when I'm in the shared State okay if instead of just doing a read I 23:46 want to actually do a right then I'm gonna go all the way up to the modified State but in order to do that I have to 23:53 you know give a stronger statements on my bus which is called bus read 23:58 exclusive okay so this is basically saying I want to get an exclusive copy of this location into my cache so what 24:07 happens if one cache issues a bus read exclusive then all the other caches 24:12 which are listening whether they're in a shared State or the memory state if they see a basra being exclusive they're 24:19 going to bring themselves back into the invalidate State and so we see that he 24:24 along this arrow for modified to invalid as well as here from shared to invalid 24:32 okay all right so now imagine that I made it to the modified State 24:39 so that means that I'm the only cast that has a copy of this address and so I'm free to do what I want with it so I 24:46 can read to it I can write to it and you'll see levels the reads and the rights in this case don't require any 24:53 bus transactions okay so this is where MSI is better than the VI protocol 24:59 because I'm trying now saving on bus transactions which are costly and and 25:07 finally there's you know a few more states like if you're in the shared state and then you want to write then 25:13 you would do a similar thing like what you did from invalidate which is issue a 25:19 processor right which causes a bus to read exclusive that takes you to the 25:24 modified state but it would make any other cache that has that address invalidate itself okay 25:31 so let's work through an example just to make sure it all makes sense so we're 25:38 going to follow the same example pretty much that we had before where we have two cores each of them has their own 25:44 private password we have this shared memory and we start off with core zero wanting to do a load of address a ok so 25:53 the first thing that happens is we issue a read robust read request of address 25:58 eight and the main memory is going to respond and so now we put ourselves into 26:04 the shared state with the value of address a which is 2 okay now core one 26:10 comes along and it also wants to load address a so it's going to initiate another bus read request and once again 26:18 the memory is going to see that it's going to respond and it's shoe is going to move into the shared State with a 26:25 value of 2 for address a alright so once 26:30 both of these cores have loaded the value of just like in the VI protocol they're now 26:36 able to respond to loads directly from 26:41 the cache they don't need to initiate further bus reads okay but let's take a 26:47 look what happens when we do stores okay so now before zero wants to do a store 26:52 to address a right so in order to do a store we're going to have to make an exclusive read request okay so we 27:00 initiate this buttery bus read exclusive for address a and what does that do well 27:08 that's going to basically tell for one it needs to invalidate its location for 27:16 address a and once it invalidates it then core zero can move to the modify 27:22 state and go ahead and update that location now you'll notice that at this 27:27 point we've updated the location in the cache we have not updated main memory 27:32 because this is a write back cache not a write through cache okay and so now you 27:41 know if core zero gets either loads or stores its able to handle both of them 27:46 locally from its own cache okay but now suppose that cores one comes along and 27:55 it's who wants to do a store to address a so now what do we have to do 28:06 anybody yeah so we're gonna reissue a 28:12 must-read exclusive the core zeros in a seedless bus read exclusive and it's 28:17 going to invalidate itself but before invalidating itself because it was in 28:23 the modified state in this case it is initiating a bus right back okay so this 28:29 right back is now taking the value that was modified in cash the cash of cours 28:35 ero and writing it back to memory okay so that memory is going to have this updated value of three written back to 28:42 it once that happens then core zero becomes invalid and now core what is 28:49 able to go ahead and modify its value and so it moves to the modified State 28:56 and it's updates its local copy of the cache to the new updated value for a yes 29:11 well if you so if you were doing a load you would have to do a right back so the 29:17 idea is basically anytime your line is dirty and you're taking it out of your past you need to at least send it back 29:25 so that you know that's your way of basically handing off the latest data 29:30 that you put in there right so if court one was just doing the load what would have happened would and be the same 29:35 which is I do a right back of location three now sorry of the value three four 29:42 address a and now for one would read that and it would get three as its value 29:47 which is what I want not what was previously in my main memory 30:03 well you want to make sure that you don't have stale copies in your cache 30:09 right so I mean we said that basically we have to have this global vision of 30:16 all the loads and stores and so you know we want to make sure that everything 30:22 actually appears on the bus even though yes in this particular case it's going to then get overwritten all right so one 30:33 thing to note here is that is that in the case of MSI you know your cache can 30:41 actually be responding as well as your main memory okay so you have a little bit of a race condition and you need to 30:48 make sure that if both of them have different answers that the one that's actually being listened to is the one 30:55 from the cache and typically that's not so hard to implement because the caches are responding more quickly but that's 31:02 just something to keep in mind all right so finally one last example for this MSI 31:10 protocol is now we want to do a final load okay and so we've got our core one 31:17 in the modified state and we want core 0 to now do a load ok so what's going to 31:23 happen in this situation in this situation we can't have core 0 dual load 31:29 while core 1 is in the modified State so we're going to issue a bus read 31:34 request core one is gonna see that bus read request and it's going to say ok I 31:40 need to get out of my modified state and so the way that it does that is it issues a bus right back in order to 31:46 write its value back and remove it from its cache so it writes the 10 back to 31:53 main memory after it does that it can go to the shared state and now the core 0 32:00 could also go to the shared state and both of them will now have the same value which is the most recent value 32:07 which core 1 had previously written ok any questions about that 32:16 right so we can make a slight 32:23 modifications on the sides and make things a little bit better if we if we 32:28 notice that we have many situations where even if I'm not sharing data I'm 32:35 going to be going through this read-modify-write sequence of events and if you look back at your FSM for MSI 32:43 what happens when I need to do a read modify write how many bus transactions 32:49 am I going to need so I'll need one for 32:56 my read and another one for the further right okay so basically even if I'm 33:03 working on my own private data that nobody else has in their cache I'm actually causing two bus transactions 33:10 which are pretty costly and so we can solve that by basically adding one more 33:15 state which is called the exclusive state and so the idea here is that 33:20 instead of going into the shared state when you do when you do a read if nobody 33:27 else has a copy of that address then you go into the exclusive state and so by 33:33 being in the exclusive state you're basically giving yourself the opportunity to save one bus transaction 33:39 when you do a write and you're the only person who actually is looking at that 33:44 who actually has that address in their cache so here's the modified FSM for 33:52 this new for state machine which is we call it Massey for modified exclusive 33:59 shared invalid and I look scary it's 34:04 pretty similar to what we had before we're only adding a couple additional arrows plus this exclusive state so 34:11 essentially the things that are changing now is if I'm in my invalid state and I 34:17 I get a process read of a particular location then if that that read is 34:24 shared meaning other caches have that address as well then I go to my shared 34:30 state but if nobody else has that address that is that I can go to this exclusive state okay once I'm in my 34:37 exclusive state then reads I can just respond to right away but more 34:44 importantly if I then do a right I can go from my exclusive to my modified by 34:50 just issuing a processor right and not having any additional bus requests 34:55 okay any questions about that all right 35:02 good all right so that's it for Snoopy based caches and you know as we add more 35:10 and more parallel processors the Snoopy based passes stop scaling basically it becomes too 35:18 cost-prohibitive to be sending you know these broadcast messages to every single one of our processors and so as your 35:24 systems get larger what we use our directory-based cache coherence and so 35:30 we're not gonna get into the details of how this works but basically the main idea is as follows where you have a 35:38 whole bunch of processors each of them has their own cache and you take your entire memory and you basically divide 35:45 it up amongst all of those processors and so each one of the processors is 35:50 responsible for a portion of the memory and in addition to being responsible for 35:56 that portion of the memory they also are responsible for maintaining this directory which has all the information 36:02 relevant to what's going on in that memory okay and so in that directory 36:07 anytime that some other processor is making a request for a particular address it's going to keep track of it 36:14 and it's going to know who has watched copy whether they're in the modified shared exclusive or whatever state and 36:20 now when it needs to make the caches you 36:26 know change something about their state instead of send a broadcast that's on a big shared bus 36:32 to everybody it actually is going to send an individual message to the particular process that is relevant for 36:39 and so we're reducing the amount of communication by moving to this model 36:44 okay and you can certainly read up more about it to get more detail now just 36:52 before there's just one last thing that I wanted to mention which is as we're 36:58 dealing with these cache coherence issues you're going to find that 37:04 different situations and we're going to look into one example here can lead you 37:10 to bad performance in your cache unnecessarily okay so let's take a look 37:16 at what I mean by that through this example so suppose that you have a cache line you know and until now you know in 37:23 in today's lecture we were just talking about a cache line that has a single word in it right but when we studied 37:29 caches we saw we learned that in reality your caches are generally blocks of 37:34 multiple words okay so imagine that my line actually has you know and plus one 37:42 words and and now imagine that one cache you know is dealing with one word in my 37:49 line while another cache is dealing with another word in my line okay because these two words happen to live in the 37:57 same line any time I want to do one processor wants to write to the location 38:06 that it is trying to modify let's say it's word I and the other processor is 38:11 trying to modify word K even though they're not actually interfering with each other in other words they're not 38:17 trying to to modify the same address at all what's going to have to happen is 38:22 that they need to get entire line into the modify state and so 38:28 basically the first pass is going to you know request to modify that line is 38:34 going to force the other ones to invalidate itself and then it's going to update you know word I and then the next 38:40 one's going to come along and it's going to want to update word K and so it too 38:45 is going to request the modify state is going to make the first process or invalidate itself so that it can update 38:52 itself even though there's actually no conflict between these two rights so the 38:59 the point that I'm trying to make here is that if we're not careful there can be a lot of ping-ponging amongst our 39:08 different classes in terms of who is requesting the the ability to write it 39:14 and sometimes it's not actually for a real reason right so like in this example you could simply you could 39:24 simply move where your words are located such that they don't end up in the same 39:30 cache line and by doing that you would you would avoid this ping-ponging problem okay and just like it occurs 39:37 here it occurs in other situations as well that so we want to always be 39:43 careful to think about you know what do we need to do if there is a possibility of this ping-ponging which is going to 39:50 reduce the performance of our caches all right so that's it for today and as I 39:59 mentioned next time is a fun lecture about putting it all together in case we 40:05 don't see you next time then it's been a great semester thank you and and good 40:11 luck finishing up all your work for tomorrow and we'll see you also at quiz Thanks