Transcript


Search in video
0:00
all right good afternoon everybody let's get started you guys want to let us in on what's so
0:07
exciting just kidding anyways so we're
0:13
almost there today we're going to be talking about ash coherence we're
0:19
basically going to continue on the theme that we started last time which is talking about issues that come up when
0:27
you are actually have multiple processors that are trying to execute something in parallel and so last time
0:34
we focused on the synchronization problems and how do we deal with things
0:39
like precedence constraints limited resources as well as critical sections
0:46
which need to not be interrupted today we're actually going to look at what happens when you have multiple caches
0:52
operating in parallel potentially on shared data and how do we deal with that
0:58
so before we get started with that just a couple final reminders so
1:05
everything is due tomorrow I expect today's lecture to actually go a little
1:13
short so hopefully that'll give you a few extra minutes to go and work on your design project if you want and as far as
1:22
what's left for the course tomorrow's recitation is gonna cover
1:27
cache coherence problems which is material that's going to be covered on
1:32
our third quiz next week however the last lecture and the last recitation are
1:40
optional I am told that the last vector is going to be really fun so you should
1:45
come and the last recitation is basically going to be a we didn't want to start
1:51
going through the practice quizzes just because it's a little too soon before our quiz so we're going to hold an
1:58
evening quiz review session a few days before the quiz and we'll use the
2:03
Wednesday recitation for is just open questions on any topic that you might think that you need help with
2:11
what else oh yeah and don't forget also to fill out - the feedback on six double
2:19
O four okay any questions before we get started all right terrific
2:29
so we talked last time about so you can either have multiple processors or
2:35
multiple cores that are part of the same CPU and this picture here what we see is
2:41
multiple cores each of which has its own private cache and then we have a shared memory that's shared across them all so
2:49
as far as your programs are concerned we already saw you know last time that you
2:56
know dealing with synchronization is difficult enough when you're having multiple things running at the same time now you can only imagine that if on top
3:04
of that the processor had to each process had to keep track of what was in each one of the private caches in order
3:11
to make sure that it did the right thing that would be nearly impossible so what we want is for these private caches to
3:18
effectively appear to be invisible to the programmer so as far as you know the
3:25
program is concerned we want them to think that there is this one large shared memory and that everything should
3:33
be consistent across it and so today we're going to see what are the issues and how do we deal with that consistency
3:40
so first let's take a look at you know what is the actual problem if we don't
3:45
do anything so imagine that I have you know here at core zero which wants to do
3:50
a load operation from address a okay so issues a load operation to address a
3:57
main memory responds and it says the value at address a is two and so now caches that value in its in its couch
4:05
with some tag information associated with address a and if it gets a future
4:11
request for address a it can respond from the cache now Core 2 comes along
4:16
and it too wants to do something with address a it's going to do a store to address a ok and
4:22
typically it's going to store the value 3 into address 8 and so it too is going
4:28
to bring address a into its cache but now it's going to update it to 3 and now when we go back to you know core 0 which
4:36
thinks it actually already has address a in its cache and it goes to do another load what it's going to do is it's going
4:43
to read value 2 which is the scale value because it should read the updated value
4:48
that was written by core 3 which is 3 I mean the core 2 which is 3 ok so we have
4:55
to figure out how do we deal with avoiding the stale data so the way that
5:01
we do that is we use what's called a cache coherence protocol and so cache
5:06
coherence protocols are basically going to ensure that we don't have stale lines and the most common way of doing this is
5:15
through and we'll get some more details of this in a second but what we call invalidate mechanisms and so what what
5:23
happened is before you allow a rights to any particular address you first have to invalidate the caches that also hold
5:32
that address so you would first invalidate cache 0 copy of address a
5:38
then you would be able to write to core twos address a and you wouldn't have a
5:44
problem when core 0 goes back to reading it because now it would actually try to
5:49
go and fetch it from main memory because it's no longer in its cache all right so
5:55
a few things that we have to make sure of in order to handle cache coherence so
6:01
the first first of all we have to enforce two rules the first is right
6:07
propagation and what what right propagation says is that what I do are
6:12
right eventually that right has to become visible to all of the processors ok and
6:18
the second thing that we have to ensure is right serialization and what this
6:23
means is that if I'm doing multiple writes to the same address then all of the processors have to see those rights
6:30
in exactly the same order ok so how do we go about ensuring these two
6:37
things so first of all for right propagation we have two mechanisms the
6:42
more common one being the first which is what we just mentioned which is write-invalidate protocols so what this
6:48
means is that when somebody wants to do a right the first thing that you do is you invalidate the copy of that same
6:56
address from all of the other caches and then the processor has its own copy of
7:01
that address and so it's now able to do the write a an alternative to this is
7:07
what's called a write-update protocol and so in this case instead of invalidating their cache the cache that
7:14
performed the most recent write would have to also make sure that it
7:19
propagates its information so that the other caches can then update their caches to the most recent value as well
7:26
but we're going to be focusing on the first which is write and validate which is a lot simpler to implement and Shore
7:34
write serialization we use one of two protocols the more basic one is called
7:41
Snoopy based protocol and we're going to spend most of today's lecture on that but a more efficient and more scalable
7:49
one is a directory based protocol which we'll just touch upon at the end of
7:55
today's lecture all right so let's start with Snoopy caches so how does Snoopy caches work so
8:01
the idea is that I have these multiple processes running in parallel and each one of them has their own private cache
8:07
and we call this the it's actually a Snoopy cash and the caches are all
8:14
connected through a shared bus okay and then from there they can also connect to
8:19
main memory now in the past when we dealt with our caches the processor was
8:24
the one making requests of either loads or stores and the cache was responding if it had the value then it would you
8:31
know have a hit and return the data if it had a Miss it would then go to main memory and fetch the data from there in
8:38
this case things got a little bit more complicated because cassia's now actually have to listen to
8:43
to to sources they have to listen to their requests coming from the processors but they also have to listen
8:50
to what's going on in the bus okay and they need to respond accordingly for
8:56
based on on both of those so how do we
9:01
go about doing this so the first thing is we have to make sure that all of our
9:08
memory requests and our broadcasts on this bus are in order so that everybody
9:14
sees all loads and all stores in exactly the same order and then what happens is
9:19
that each cache controller is going to snoop which is where they get their name from I'm the boss in order to see what's
9:27
going on on the bus not only for its own transactions but also for what everybody else is doing and based on whatever what
9:33
else whatever is going on the bus it's going to potentially change the state of its cache lines so basically what we're
9:42
gonna do is we're going to implement a finite state machine which defines what
9:47
exactly each cache line and needs to do depending on its current state and what's going on in either the processor
9:55
request or the shared bus all right so we're gonna start off with a pretty
10:01
simple protocol this is just valid invalid so just like in our original
10:08
implementation of caches we had a concept of having a valid bit where the valid bit told you whether you know the
10:14
pass line that you have is valid or whether you know you just start it off or something and and it might be garbage
10:21
that's in there in in cache coherence we want to use the valid bit to indicate a
10:28
little bit more and so the way that the that we're going to use the validate
10:33
validate for cache coherence is as follows it's going to follow this FSM that's described here where we have two
10:40
states in valid and invalid so you start off in the invalid state and suppose
10:45
that the processor issues a read request or a load okay so if it is use a read
10:52
request then I'm going to fetch something from main memory the way that I do that is by
10:58
initiating a bus read request and so the bus read is going to go talk to main
11:03
memory it's going to get the data that it's requesting from a memory and bring it into the cache and when I bring it in
11:10
so the cache my state is now valid so I've moved to this valid state over here okay once I'm in the valid state that
11:17
means I brought the value from memory and so if I now get another read request I can just respond directly from my
11:25
cache and so that's what this feedback arrow here represents where it says if
11:31
we get another read request and we don't have to initiate any additional bus
11:36
transactions we can just respond from our cache however if we get a write
11:41
request then we have to initiate what's called a bus write transaction so the
11:48
idea is that in this implementation the caches that we're using are right through caches can anybody tell me if
11:56
they remember what right through passes yep
12:07
correct so you're always writing back to memory as well okay so basically what's
12:12
happening here is even when I have a valid I'm in the valid state if I'm
12:18
doing a write I am going to actually initiate a bus right so that my right is
12:24
also written to main memory okay now what is this dash line here mean
12:30
this means suppose that I'm another cache and I'm seeing on the bus a bus
12:36
right request initiated by somebody else what you know what do I need to do and
12:42
so if I was in a valid state and someone else is trying to now write a particular
12:49
location that I also have in my cache I need to now pay attention to this bus
12:54
right action and become invalid so that the only cash that has a valid copy of
13:04
that particular address is the one that's actually doing the writing any questions about that great and then
13:15
finally in in this particular protocol because it's a write through if we're in
13:21
the invalid state and we're trying to we get a process write request then we
13:27
simply write directly to memory and we stay in the invalid state and so we initiate this bus right action which
13:35
writes directly to memory okay so so far not too complicated so let's work
13:41
through a simple example just to make sure that you know all these
13:46
transactions make sense so imagine a situation where we just have two cores core 0 and core 1
13:53
each one of them has their own private cache and we have this shared main memory so imagine now that for 0 starts
14:01
off by doing a load of address a so what is that going to do based on our FSM
14:08
that's going to initiate a bus read request for address a and the memory C
14:14
bless read requests and is going to respond with what is an address a and so
14:20
that response is going to come back to the cache of course zero and it's going to change its state to the valid state
14:27
and it's going to say that for you know the time associated with address a the
14:32
value is 2 okay so now we proceed and now corwin also wants to load address a
14:40
so in this case once again it's going to initiate a bus read request and it too
14:47
is going to get a response and so it's going to change its state to valid and
14:53
also update its Casta to now if i get any further read requests from either
14:58
core 0 or core 1 I no longer need to initiate any more bus transactions
15:03
because I now have it in my cache and I have a valid copy in my cache so any further responses can come directly from
15:10
the cache but now suppose that the next
15:16
thing that happens is that core 0 wants to execute a store to address a ok we
15:23
said that only you know when we execute a store we need the other caches to
15:29
invalidate their copies ok so the way this is going to happen is that the
15:34
store is going to initiate a bus right action on our bus and that's going to
15:41
take this value 3 which we're trying to write in to address a an update main memory with it but in addition it's
15:49
going to go to core 1 and have core 1 invalidate its entry for address a in
15:55
the cache ok now once it invalidates its entry then the or and all if you had
16:00
more cores and then all of them invalidated their entries and now core 0 is the only one that actually has a copy
16:08
of address a that's valid in its cache and so now it's able to go ahead and
16:14
update it to 3 and everything will continue to be consistent across the
16:19
entire system so now if you know core 1 now does another load of a because its value was
16:27
made invalid it's going to issue another bus reader and it's going to get you know a
16:34
response that says okay now the new value of address a is three okay
16:40
yes because it's a write through cache
16:48
and so you're always writing to memory anyway so there's no particular needed
16:54
to bring it into your cost I mean if you were going to then use it for anything else you could also do another read
17:01
which would then bring the updated value into the cache in you don't you could I
17:10
mean and a right through pass you could do you could do it both ways okay so who
17:19
sees some problems with this VI simple protocol is it great should we stop here
17:29
and go home or can we do a little better yeah well we're it's true but it's not
17:46
actually updating we're invalidating which is a little bit faster than actually updating but the key to notice
17:53
here is that every time we're doing a write what happens to happen
18:00
I need to update my memory and I need to
18:05
initiate a bus transaction okay so regardless of whether I'm whether I
18:14
have you know the data in my cache or not I'm always right to writing back to main memory which is costly because I
18:20
don't have to necessarily write every single one of those values back and in addition every single write requires a
18:28
broadcast on this bus which is an expensive operation okay so we want to try to improve on that so
18:37
before we get to our next protocol let's talk a little bit about you know what do
18:43
we need to be true in order for our cache coherence to work properly so we
18:48
said that all of the loads and stories that are seen in our entire system have
18:54
to have a global ordering meaning that everybody sees them in exactly the same order now if we're not careful and
19:01
multiple caches have copies of the same address and you can imagine you know getting yourself into trouble where
19:08
that's not actually what's happening so in order to ensure that we do have this
19:13
global ordering we can do the following the first is that we only allow one
19:18
cache at a time to have write permission to an address so just like we saw in the
19:24
simple valid and validate valid and valid protocol when one of the caches
19:29
wanted to write to the location the other one if it had a valid copy of it had to become invalid okay and the
19:37
second thing is that at no point were any cache had a stale copy of the data
19:43
right so it's already going to go to the invalid mode and so it will no longer you know consider the old value to to be
19:52
the correct answer all right so we're going to introduce now a new protocol
19:57
which is a slightly modified version of the valid invalid and this is called MSI
20:03
which stands for a modified shared invalid and um so the invalid state basically means
20:11
the same thing which is you don't have a copy of that address in your cache the
20:17
shared bit or the shared state indicates that your cache has a particular address
20:25
in it but you are only allowed to execute reads to it because some other caches may have that address as well and
20:33
finally the modified state means that only this cache has the address and so
20:39
it's safe to do both reads and writes if you're in the modified state ok all
20:45
right so let's take a look at what the FSM for just MSI protocol looks like and
20:52
first of all let's remind ourselves what we said the drawbacks or the VI mechanism were so the drawback was that
20:59
every update were also wrote to main memory and every update caused a
21:04
broadcast on our bus so and that sign is gonna solve both of those issues so
21:10
first of all it's going to allow for us to have write back caches it doesn't have to be a write
21:15
through cache and some of our of our rights are going to be able to be
21:21
satisfied locally without having to initiate any bus transactions so let's
21:26
take a look at this now the state diagram might look a little more scary so let's go through what this is telling
21:33
us ok so first of all we have our three states M s and I the ones at the bottom means
21:41
we essentially have fewer permissions and as we move up in the states we have
21:46
more permissions so the modify is a state where that gives us the most permissions which allows us to actually
21:52
write to a particular location on the left here what we show here in orange
22:00
arrows are the processor initiated requests ok and so the processor
22:06
initiated a request can cause some bus transactions which is what's shown after
22:12
this slash but all this side of our of our FSM is
22:18
initiated by a processor request now you'll notice that when we have
22:24
processor requests what's basically happening is that the processor is getting us to have more permission for a
22:31
particular location and so what's happening is that we're moving up in this FSM from you know a lower state to
22:38
a higher state okay on the right hand side these dashed blue lines we see bus
22:44
initiated transactions okay so this means when you know some other cash its
22:51
responded to some processor request and as a result it puts some broadcast
22:59
something onto the bus then all the other caches that are seeing that information on the bus need to respond
23:05
accordingly okay so let's think a little bit of a closer look at you know what actually is
23:11
happening so we'll start off you know in the invalid state and let's say that I
23:16
want to read something right so my processor issues a processor read and
23:22
that's going to cause a bus read request in order to tell everybody else that I'm
23:30
reading this particular location and now I moved into the shared State so I am
23:37
able to now read it but not modify it when I'm in the shared State okay if instead of just doing a read I
23:46
want to actually do a right then I'm gonna go all the way up to the modified State but in order to do that I have to
23:53
you know give a stronger statements on my bus which is called bus read
23:58
exclusive okay so this is basically saying I want to get an exclusive copy of this location into my cache so what
24:07
happens if one cache issues a bus read exclusive then all the other caches
24:12
which are listening whether they're in a shared State or the memory state if they see a basra being exclusive they're
24:19
going to bring themselves back into the invalidate State and so we see that he
24:24
along this arrow for modified to invalid as well as here from shared to invalid
24:32
okay all right so now imagine that I made it to the modified State
24:39
so that means that I'm the only cast that has a copy of this address and so I'm free to do what I want with it so I
24:46
can read to it I can write to it and you'll see levels the reads and the rights in this case don't require any
24:53
bus transactions okay so this is where MSI is better than the VI protocol
24:59
because I'm trying now saving on bus transactions which are costly and and
25:07
finally there's you know a few more states like if you're in the shared state and then you want to write then
25:13
you would do a similar thing like what you did from invalidate which is issue a
25:19
processor right which causes a bus to read exclusive that takes you to the
25:24
modified state but it would make any other cache that has that address invalidate itself okay
25:31
so let's work through an example just to make sure it all makes sense so we're
25:38
going to follow the same example pretty much that we had before where we have two cores each of them has their own
25:44
private password we have this shared memory and we start off with core zero wanting to do a load of address a ok so
25:53
the first thing that happens is we issue a read robust read request of address
25:58
eight and the main memory is going to respond and so now we put ourselves into
26:04
the shared state with the value of address a which is 2 okay now core one
26:10
comes along and it also wants to load address a so it's going to initiate another bus read request and once again
26:18
the memory is going to see that it's going to respond and it's shoe is going to move into the shared State with a
26:25
value of 2 for address a alright so once
26:30
both of these cores have loaded the value of just like in the VI protocol they're now
26:36
able to respond to loads directly from
26:41
the cache they don't need to initiate further bus reads okay but let's take a
26:47
look what happens when we do stores okay so now before zero wants to do a store
26:52
to address a right so in order to do a store we're going to have to make an exclusive read request okay so we
27:00
initiate this buttery bus read exclusive for address a and what does that do well
27:08
that's going to basically tell for one it needs to invalidate its location for
27:16
address a and once it invalidates it then core zero can move to the modify
27:22
state and go ahead and update that location now you'll notice that at this
27:27
point we've updated the location in the cache we have not updated main memory
27:32
because this is a write back cache not a write through cache okay and so now you
27:41
know if core zero gets either loads or stores its able to handle both of them
27:46
locally from its own cache okay but now suppose that cores one comes along and
27:55
it's who wants to do a store to address a so now what do we have to do
28:06
anybody yeah so we're gonna reissue a
28:12
must-read exclusive the core zeros in a seedless bus read exclusive and it's
28:17
going to invalidate itself but before invalidating itself because it was in
28:23
the modified state in this case it is initiating a bus right back okay so this
28:29
right back is now taking the value that was modified in cash the cash of cours
28:35
ero and writing it back to memory okay so that memory is going to have this updated value of three written back to
28:42
it once that happens then core zero becomes invalid and now core what is
28:49
able to go ahead and modify its value and so it moves to the modified State
28:56
and it's updates its local copy of the cache to the new updated value for a yes
29:11
well if you so if you were doing a load you would have to do a right back so the
29:17
idea is basically anytime your line is dirty and you're taking it out of your past you need to at least send it back
29:25
so that you know that's your way of basically handing off the latest data
29:30
that you put in there right so if court one was just doing the load what would have happened would and be the same
29:35
which is I do a right back of location three now sorry of the value three four
29:42
address a and now for one would read that and it would get three as its value
29:47
which is what I want not what was previously in my main memory
30:03
well you want to make sure that you don't have stale copies in your cache
30:09
right so I mean we said that basically we have to have this global vision of
30:16
all the loads and stores and so you know we want to make sure that everything
30:22
actually appears on the bus even though yes in this particular case it's going to then get overwritten all right so one
30:33
thing to note here is that is that in the case of MSI you know your cache can
30:41
actually be responding as well as your main memory okay so you have a little bit of a race condition and you need to
30:48
make sure that if both of them have different answers that the one that's actually being listened to is the one
30:55
from the cache and typically that's not so hard to implement because the caches are responding more quickly but that's
31:02
just something to keep in mind all right so finally one last example for this MSI
31:10
protocol is now we want to do a final load okay and so we've got our core one
31:17
in the modified state and we want core 0 to now do a load ok so what's going to
31:23
happen in this situation in this situation we can't have core 0 dual load
31:29
while core 1 is in the modified State so we're going to issue a bus read
31:34
request core one is gonna see that bus read request and it's going to say ok I
31:40
need to get out of my modified state and so the way that it does that is it issues a bus right back in order to
31:46
write its value back and remove it from its cache so it writes the 10 back to
31:53
main memory after it does that it can go to the shared state and now the core 0
32:00
could also go to the shared state and both of them will now have the same value which is the most recent value
32:07
which core 1 had previously written ok any questions about that
32:16
right so we can make a slight
32:23
modifications on the sides and make things a little bit better if we if we
32:28
notice that we have many situations where even if I'm not sharing data I'm
32:35
going to be going through this read-modify-write sequence of events and if you look back at your FSM for MSI
32:43
what happens when I need to do a read modify write how many bus transactions
32:49
am I going to need so I'll need one for
32:56
my read and another one for the further right okay so basically even if I'm
33:03
working on my own private data that nobody else has in their cache I'm actually causing two bus transactions
33:10
which are pretty costly and so we can solve that by basically adding one more
33:15
state which is called the exclusive state and so the idea here is that
33:20
instead of going into the shared state when you do when you do a read if nobody
33:27
else has a copy of that address then you go into the exclusive state and so by
33:33
being in the exclusive state you're basically giving yourself the opportunity to save one bus transaction
33:39
when you do a write and you're the only person who actually is looking at that
33:44
who actually has that address in their cache so here's the modified FSM for
33:52
this new for state machine which is we call it Massey for modified exclusive
33:59
shared invalid and I look scary it's
34:04
pretty similar to what we had before we're only adding a couple additional arrows plus this exclusive state so
34:11
essentially the things that are changing now is if I'm in my invalid state and I
34:17
I get a process read of a particular location then if that that read is
34:24
shared meaning other caches have that address as well then I go to my shared
34:30
state but if nobody else has that address that is that I can go to this exclusive state okay once I'm in my
34:37
exclusive state then reads I can just respond to right away but more
34:44
importantly if I then do a right I can go from my exclusive to my modified by
34:50
just issuing a processor right and not having any additional bus requests
34:55
okay any questions about that all right
35:02
good all right so that's it for Snoopy based caches and you know as we add more
35:10
and more parallel processors the Snoopy based passes stop scaling basically it becomes too
35:18
cost-prohibitive to be sending you know these broadcast messages to every single one of our processors and so as your
35:24
systems get larger what we use our directory-based cache coherence and so
35:30
we're not gonna get into the details of how this works but basically the main idea is as follows where you have a
35:38
whole bunch of processors each of them has their own cache and you take your entire memory and you basically divide
35:45
it up amongst all of those processors and so each one of the processors is
35:50
responsible for a portion of the memory and in addition to being responsible for
35:56
that portion of the memory they also are responsible for maintaining this directory which has all the information
36:02
relevant to what's going on in that memory okay and so in that directory
36:07
anytime that some other processor is making a request for a particular address it's going to keep track of it
36:14
and it's going to know who has watched copy whether they're in the modified shared exclusive or whatever state and
36:20
now when it needs to make the caches you
36:26
know change something about their state instead of send a broadcast that's on a big shared bus
36:32
to everybody it actually is going to send an individual message to the particular process that is relevant for
36:39
and so we're reducing the amount of communication by moving to this model
36:44
okay and you can certainly read up more about it to get more detail now just
36:52
before there's just one last thing that I wanted to mention which is as we're
36:58
dealing with these cache coherence issues you're going to find that
37:04
different situations and we're going to look into one example here can lead you
37:10
to bad performance in your cache unnecessarily okay so let's take a look
37:16
at what I mean by that through this example so suppose that you have a cache line you know and until now you know in
37:23
in today's lecture we were just talking about a cache line that has a single word in it right but when we studied
37:29
caches we saw we learned that in reality your caches are generally blocks of
37:34
multiple words okay so imagine that my line actually has you know and plus one
37:42
words and and now imagine that one cache you know is dealing with one word in my
37:49
line while another cache is dealing with another word in my line okay because these two words happen to live in the
37:57
same line any time I want to do one processor wants to write to the location
38:06
that it is trying to modify let's say it's word I and the other processor is
38:11
trying to modify word K even though they're not actually interfering with each other in other words they're not
38:17
trying to to modify the same address at all what's going to have to happen is
38:22
that they need to get entire line into the modify state and so
38:28
basically the first pass is going to you know request to modify that line is
38:34
going to force the other ones to invalidate itself and then it's going to update you know word I and then the next
38:40
one's going to come along and it's going to want to update word K and so it too
38:45
is going to request the modify state is going to make the first process or invalidate itself so that it can update
38:52
itself even though there's actually no conflict between these two rights so the
38:59
the point that I'm trying to make here is that if we're not careful there can be a lot of ping-ponging amongst our
39:08
different classes in terms of who is requesting the the ability to write it
39:14
and sometimes it's not actually for a real reason right so like in this example you could simply you could
39:24
simply move where your words are located such that they don't end up in the same
39:30
cache line and by doing that you would you would avoid this ping-ponging problem okay and just like it occurs
39:37
here it occurs in other situations as well that so we want to always be
39:43
careful to think about you know what do we need to do if there is a possibility of this ping-ponging which is going to
39:50
reduce the performance of our caches all right so that's it for today and as I
39:59
mentioned next time is a fun lecture about putting it all together in case we
40:05
don't see you next time then it's been a great semester thank you and and good
40:11
luck finishing up all your work for tomorrow and we'll see you also at quiz Thanks