Transcript


Search in video
0:00
the goal of this lecture is to first give you a very broad introduction or a very high-level introduction to how
0:07
modern processors work and so this is so that you can you know when you write programs and when you try to optimize
0:14
the performance of your programs that you understand a high level what mechanisms are going on how is the
0:19
processor trying to help you improve performance and then the second part of the of the lecture we're going to focus
0:25
on branch prediction and and here we'll discuss a technique that is going to be quite useful for your design project ok
0:32
so just as a reminder this is the iron law of processor performance where
0:38
remember we want to divide or Express performance as the product of two of
0:43
three different terms the instructions that its program takes the cycles per instruction or CPI that it's that the
0:50
processor takes to execute the instructions in the program and the time per cycle or the clock cycle time right
0:57
and so you can improve any of these three and that will improve the performance of your system we're going
1:05
to focus as before in the last two terms in CPI and clock cycle time so remember
1:10
that when we introduced pipelining we said well you know pipelining is a good technique to reduce clock cycle time and
1:16
therefore improve clock cycle sorry improve clock frequency right that's the
1:22
reason why you have processors today that can run at 3 or 4 gigahertz that's
1:27
because they have you know 10 20 30 pipeline stages but the other component
1:34
here is CPI is the number of cycles per instruction that our processor takes on
1:39
average in a particular program and so let's be a little bit more systematic
1:44
about how we try to improve CPI by first dividing it into two different terms so
1:51
I'm going to say that the CPI the cycles per instruction can be divided into two components what we call CPI ideal which
1:58
is the cycles per instruction that your machine will achieve either if there were no stalls if there were no
2:04
mispredictions if essentially you could execute the maximum amount of instructions through the pipeline every
2:11
cycle Plus similar component that we're going to call a CPI hazard right this is the
2:16
contribution of data and control hazards to the cycles per instruction right so you can see hazards either you know do
2:24
two memory accesses do two operations that have dependencies on their values
2:30
as adding cycles right there adding cycles to each instruction that you're
2:36
executing and so on average this is going to contribute to the CVI but also
2:41
control hazards so you know when you have branches when you have jumps also when you have exceptions an exception
2:47
remember is an implicit branch going on every instruction and so that is going to add also some extra cycles into these
2:55
control hazards that we need to handle okay so we started with looking at this
3:02
classic five state spike right where we had you know fetch decode execute memory access and write back and so this has
3:11
you know it's a very simple design you're basically starting from something very similar even a little bit simpler
3:18
in the design project and it is still used if you don't want super high
3:25
performance so you know many chips used
3:30
in embedded devices and things like network routers like Wi-Fi access points and you know other devices will have for
3:39
example low-power ARM processors that implement precisely this this pipeline but this has some problems right even
3:46
though the ideal CPI is one you know there's well the upper performance bond
3:52
is the CPI of one we can actually do better we as we as we will discuss later
3:58
when you have a high latency instruction for example if you have a cache miss on the memory states you know the problem
4:05
is that now everything behind this instruction that causes a store also
4:10
stalls right so a single a single stall propagates immediately through through
4:17
the pipeline and of course you know you have long operations for example you have a mold
4:22
flyer that you'd like to pipeline and take you know three or four or five cycles now you need to dramatically
4:29
increase the number of stages a year in your design right and so just five
4:35
stages often doesn't suffice to have or to include enough time to do you know
4:42
complex operations okay so there are essentially four things that we can do
4:48
to improve the performance of this pipeline there are two that are fairly simple so we can keep pipelining right
4:57
keep dividing our pipeline into more and more stages to increase the clock frequency another you know conceptually
5:06
simple though mechanically quite complicated optimization is what we call wider pipelines where instead of
5:12
executing one instruction per cycle we're going to try to fetch decode execute and so on multiple instructions
5:18
for every cycle then the problem is that if we just apply these techniques our performance is not going to be good
5:25
because yes we're reducing the clock cycle time we're getting great frequency
5:30
we're reducing the ideal CPI but we still have to deal with hazards and so
5:35
there are basically two techniques that are used to deal with hazards when we talk about data hazards typically these
5:40
are handled by what we call out of order execution where instead of executing instructions in the order that they
5:46
appear in the program the processor will try to execute instructions as soon as their source operands are available and
5:53
the second technique that we will say more detail is essentially reducing the impact of control hazards through what
5:59
we call branch prediction but really this is prediction of branches and jumps and so we're going to try to do better
6:06
predictions than just PC plus 4 when we're trying to figure out what is the
6:11
next PC that our pipeline should fetch ok
6:17
so I'll explain the first three techniques fairly quickly and this is purely for demystification purposes it's
6:24
not going to be on the quiz this is just so that you understand when you you know have a program running in a modern
6:30
processor why is it that you know you I see that the code has a certain data
6:36
dependence and somehow processor actually is overlapping things in ways that you don't expect that's because of
6:41
some of these techniques okay so deeper pipelines essentially the
6:46
key idea here is if five stages were good then ten must be better than 15 must be better right so instead of
6:53
having five stages we can keep breaking up the work that we do into more and
6:58
more and more stages so instead of you know having a single a single cycle for fetching we can have two cycles which
7:04
allows us to clog the memory faster or to use larger memories we can break up decoding to decode and read read
7:11
registers and so on and so forth right and so you know this is good because
7:18
basically to a good degree this is the main weapon that modern processors use
7:24
to run at multiple multiple gigahertz so for example you know the signs in the early 2000s had around eleven twelve
7:30
stages in the mid 2000s you know there there was a period of time where the
7:36
main way to sell performance was frequency and so even if processors
7:42
didn't do much work per clock cycle basically processor designers got into
7:47
this race of how much come with pipeline these this processor so that we can sell
7:53
you processors that run at four gigahertz instead of three gigahertz because you know people want a single number to summarize performance and so
8:00
that's why you have designs like the Pentium 4 that has 34 pipeline stages it's a little bit on the high end and so
8:06
these days well you'll see the signs that have sixteen to twenty pipeline stages so there are some coasts of
8:14
course you know they all use one is you have more pipeline registers right but also you're gonna have more bypass paths
8:20
right so remember that you know if you have sequences of instructions here that have dependences you want to be able to
8:28
bypass the value the source operand that might be you know down in these stages
8:35
right before it is consumed so in this case at the output of the read register stage you will need a bypass path from
8:41
here from here from here and also from here right and so you need to add
8:46
more and more and more bypass spots over time you know remember maxes are expensive wires are also expensive and
8:53
so this doesn't scale very well okay and so all in all right the more you overlap
9:02
the more dependencies you're going to have right so the more instructions we have through the pipeline the more the
9:12
longer our dependencies are going among nearby instructions are going to hurt us
9:17
in terms of cycles right because you know things that used to have so for example if you know before and every
9:24
mispredicted branch we were incurring a penalty of two cycles or two instructions now you know our cycles
9:31
might be very fast but if we're resolving branches in states five we're going to incur a penalty of five
9:36
instructions right we're going to throw five instructions on every misprediction there is also the issue that at some
9:43
point clock clogging overhead things like set-up time of the registers
9:48
becomes dominant right so if you're spending more time on the set-up time of the registers and on the actual logic of
9:55
its pipeline States you actually don't get a lot of performance benefit and finally you know with very high clock
10:01
frequencies you basically increase power consumption by a lot and so this also
10:07
runs into limitations okay any questions
10:14
alright so the other broad technique that we can use is to instead of trying
10:21
to have this very high clock frequency have a wider pipeline where every state
10:28
processes multiple instructions per cycle so for example in this pipeline
10:33
over here I'm showing Ford if you know it's pipeline register has four different elements which represent four
10:39
different instructions right and so here the idea is that you know you're gonna have a very wide instruction cache or a
10:45
very wide instruction memory and instead of giving you 32 bits it would give you 128 bits right for instructions PC plus
10:53
for PC PC plus 4 plus 8 plus 12 right so you can fetch a chunk of four
10:59
instructions then you can decode them in parallel read their registers in parallel execute them in parallel you'll
11:05
need for Al use but you know you can do that even you need to access memory you can do this memory accesses in parallel
11:11
and so on so this that's the CPI ideal
11:18
component that I mentioned right so now if we have a pipeline of width W we're
11:24
going to have a CPI ideal of 1 over W right because in the best case it's
11:30
instruction takes 1 over W cycles in other words every cycle we're executing W instructions right okay so modern
11:42
processors will routinely try to be fetch decode execute three to four
11:49
instructions per cycle and so if you're writing high-performance code it helps
11:56
to you know do it in a way where you don't have very tight dependences
12:01
because of course if you have a dependence between two instructions in the same cycle right at some point your
12:08
the processor is not going to be able to push through all these dependent instructions through the pipeline in the
12:14
same cycle right you want to do so amount some measure of independent work okay the problem with this technique is
12:21
that actually there are a lot of quadratic overheads you need a wider instruction cash anymore I'll use you need a
12:27
register file that you know in this case instead of two reports has eight reports
12:32
and that adds a lot of wiring again and so you know all of these also comes at
12:40
the expense of when you have because you're trying to execute multiple instructions you need cycle as I said
12:46
you're the impact of hazards is going to grow right so you know you're going to
12:52
have many cases where you have dependent instructions here that cannot progress through the pipeline in the same cycle
12:58
and so you might need to hold off on some of the instructions that are fetched or decode it for a few cycles
13:04
okay so this is something that you could actually try to do in the design project it's kind of complicated but you know a
13:13
superscalar pipeline that tries to run a couple of instructions per cycle could be doable but there are simpler
13:20
techniques and you don't you absolutely do not need to do this to get full points okay so this brings us to the
13:28
issue of hazards and specifically you know data hazards so we have these these three strategies right stall bypass
13:35
speculate we use stalling and bypassing for data hazards we use either stolen or
13:41
speculation for control hazards so we said you know control hazards are allowed speculation because it's easy to
13:49
make good guesses right and so we'll see that it's actually easy to make really
13:55
good guesses in the last part of the of the lecture but now I want to focus on
14:00
data hazards and essentially introduce a fourth technique a fourth strategy which
14:05
is essentially to find something else to do if you have a an instruction that
14:12
depends on some value that's not yet available on some data value that's not dateable available for example you know
14:19
there's a load and then there's another instruction that's consuming that load well you could stall right it predicting
14:25
it doesn't really work bypassing it doesn't really work because you're still fetching it from memory but what if
14:32
there's some other instruction downstream right not the next instruction in program order
14:38
but some of the instruction later on that's actually independent that's operating on independent data and so
14:44
they the way to do this is through what we call out of order execution not a
14:50
very good marketing term you know you won't see lots of processors saying you
14:56
know we'll we'll sell you an out of order processor I think Intel calls this dynamic
15:01
execution or some other thing but you know the technical term is what we call
15:06
our further execution because essentially what we're trying to do is execute instructions in what we call
15:12
dataflow order or as soon as their input operands are available and so to see how
15:18
these works consider this expression you want to compute D you know it's three
15:23
times a minus B plus seven times a times C and so both a B and C are in memories
15:30
or you're gonna have to load them and then we're going to do some computations to peruse D and finally we're going to
15:35
store D also this one location in memory and so you could produce you know a compiler could produce the sequential
15:42
code write the sequence of instructions to compute D where it loads a then it
15:48
loads B then it subtracts a and B then it multiplies three times a minus B right and so it gets this part of the
15:55
term then it loads C because it needs C to produce the second part of the term multiply a times C set computes through
16:04
another multiply seven times a times C then finally adds everything up stores the result but not every instruction
16:13
here depends on the result of previous instructions right so instead we can
16:18
represent this computation as what we call a data flow graph and so in a data flow graph nodes or vertices represent
16:27
computations or operations and then edges between the vertices represent input or output data that's flowing
16:34
between these operations so here you can see that you know we can load B a and C
16:41
right and then this minus operation takes B and a as input right and so on
16:49
see through bye-bye expression things as this data flow graph you can see that not everything is dependent right in
16:55
fact if you have a stall in this load B
17:01
instruction you could actually run instructions down below here right load
17:07
C male seed mole 7 AC that don't depend on the UM B and so here's pictorially
17:17
how this this works right so the processor is trying to execute B all right so B has not completed execution
17:23
yet but it can make progress through operations that are not dependent on the
17:29
results of B and so it is you know in this case it has already completed the
17:36
execution of a times C it has completed two loads of a and C it is executing
17:42
seven times a times C and so on and so what processors will actually do is you
17:49
pass them this sequential code right this series of instructions they will
17:55
dynamically recover the data flow graph right they will dynamically recreate this this graph internally and then it
18:03
will run instruction as as as soon as they can okay so here's how how it all looks like
18:10
if you put all of these techniques together so you have you know a lots of different stages you know three or four
18:16
instructions per stage basically there's a first part of the processor that
18:24
fetches instructions in order and tries to decode them in order and then
18:29
essentially it reconstructs this data flow graph that stores all the dependencies between instructions then
18:36
we have what we call in out of order execution place where we execute its instruction using different functional
18:42
units as soon as its source operands are available and finally we write back the
18:48
result into the register file or into memory in program order so we have this last phase where there's some last stage
18:56
that is doing all the writes in order can you tell me why we need to do all
19:03
the rights in order yes No
19:20
but it does have to do with nice speculation so if so anyway yes
19:34
[Music]
19:52
so by looking at the straight line code it's hard to see but what's happening
19:59
here is we're using the same speculation tricks that we were using in our simpler pipelines so for every brand we are
20:07
predicting what the next address is going to be and then that means that we
20:14
might execute some work that we shouldn't have executed and because you can run work much more widely out of
20:20
order in these systems some instructions might actually make it all the way through execute and you might have the
20:26
results ready to be written back and yet that work should be discarded and so
20:32
that's why there's a last phase that reorders everything so this is a saddle point this is also you know when you
20:39
start looking at these complex pipelines and the interaction with other mechanisms like TLB some virtual memory
20:45
you know basically now you have all the ingredients that you need to understand all these modern security or all these
20:52
recent security attacks like meltdown inspector that essentially rely on this
20:58
out of order execution feature and so they're they're running work very far
21:04
ahead essentially misusing these this execution resources to leak information
21:09
okay so that concludes our discussion of the three techniques but so at a high
21:15
level what happens is you have this very long pipeline right you have this technique that's trying to execute
21:20
instructions based on their dependences rather than in their order in the program and now they usually that we
21:27
have remaining and what I want to focus on for the rest of the lecture is how to do it handle control flow hazards and so
21:34
you end up with these fairly long pipelines right and you know in recent
21:42
processors you'll take anywhere from 1015 cycles from when a branch is
21:48
fetched until it is resolved until it is executed and you actually know for sure whether the next PC was what you
21:57
predicted or something else and so this is what we call a loose loop we say that
22:04
you know when you have a lot of pipeline stages between two events that essentially create a dependence through
22:11
the pipeline this is a loose loops and it's a loose loop because you know
22:17
there's more than one pipeline stage between them and that's going to impact performance okay so how much how much
22:26
performance do we lose every time that the pipeline doesn't follow correct instruction flow so if we have a branch
22:33
here and we miss predict the next PC we send it down and then let's say 15
22:39
cycles later we figure out that that's the wrong path and so we need to
22:45
redirect the pipeline how much work have we lost here
22:55
yes is it about the same work that we do to like set up the pipeline yeah because
23:00
everything behind it is gonna be invalid exactly exactly so if you were executing one instruction
23:06
in its States right it would just be the number of stages right because that's
23:11
how many instructions you could have implied in fact it's even worse because your pipelines can execute multiple
23:16
instructions per cycle right so in modern processors each branch
23:24
misprediction will waste forty to fifty instructions and so the problem is that
23:31
branches happen every five to twenty instructions so if every five to twenty instructions let's say we're miss
23:37
predicting 50% of the branches because we're just doing PC plus 4 we're predicting PC plus 4 then once every
23:45
let's say 10 instructions we are adding all this waste right we're throwing away
23:52
30 instructions is this good like what's
24:00
the performance impact of that if once every ten instructions we're taking 30
24:07
instructions worth of work that come after I'm throwing them out right
24:13
turning them into no ops and then restarting an execution then you have
24:18
ten ten instructions worth of useful work 30 instruction was wasted right so your performance impact is 4x right
24:26
you're wasting most of your execution resources because you're not holding the
24:32
right path yes
24:52
absolutely so you are losing work because you're executing work that you end up discarding now if you were able
24:59
to take better guesses you wouldn't be losing work so imagine that in the best
25:05
case you have some magical device here that always told you what the next PC was right and so in that case you would
25:13
be able to do much better this is this is why you know I say that we're losing work and in fact what we're going to see
25:19
in the rest of the lecture is that you could get very close you can predict ninety nine ninety nine point five
25:25
percent of branches and modern processors take one misprediction or have one beat spurt in this prediction
25:31
every thousand cycles or so okay so again the name of the game here is to
25:38
improve the accuracy of guesses and so let's go back to you know risk five
25:45
branches and jumps and so remember that control flow hazards happen because each instruction sets depends on information
25:52
from the preceding instruction or you know some preceding summer of preceding instructions ahead so we want to know
25:59
two things is this a branch or a jump right is this
26:04
a is this something that alters control flow or is it going to go to PC plus 4
26:10
and if it is taken if it's a taken branch or a jump then what is the target
26:15
when can what is the earliest point at which we can know the target and so four
26:21
different instructions in instruction types you'll have different points in the pipeline at which you'll know now
26:29
you know as I as I mentioned a few lectures ago this is specific to your
26:34
pipeline but let's go through this exercise by thinking you know when what is the earliest that we can find you
26:42
know both whether it is at whether it's taken and what if it's taken what's the
26:49
target is so for a jump on link right a jump on link computes the next PC using
26:55
the immediate and the current PC right so when do you know it's taken what's
27:01
the earliest you can know in the that it's taken yes
27:06
decode yep once once you fetch the instruction you can take a look and save
27:12
you up this is this is a jump on link that's the opcode so I know that I'm going to have to redirect the PC when do
27:19
you know the target when do you know what's what the next PC is almost if
27:30
this was a jump on link register then yes but this is a jump on link and so the next PC comes from the PC plus
27:37
immediate yes yeah
27:42
as early as the code you can stick an ally in there and compute the next PC
27:47
right so jump and Link is sort of the best case so jump on link register taken
27:54
same thing right we can take a look at the opcode and know that it's a jump and
27:59
therefore we're going to change the PC for sure after decode but when is the target known after execute right so you
28:11
need to at least read the register and do some computation with it now you could have some pipeline where you have
28:16
a register read stage and you know you stick another in there and you know if the timing fits you could know the
28:23
target a little bit earlier to write how about branches when you know it's taken
28:33
yes addict after executed exactly right
28:38
because we need to read the registers compare them figure out whether the
28:43
branch condition is met or not and that's gonna take some time when is the target node then though so
28:51
the target for for jump on link and the target for branches is exactly the same it's exactly the same computation so you
28:57
actually know the target in decode okay so armed with all of these let's start
29:05
seeing how we can improve performance with better branch prediction as I said
29:11
the name of the game here is to make good guesses - we've been guessing PC
29:18
plus 4 as the next value of the value of the next PC but we're going to discuss a
29:25
few strategies that let us predict both the target and the branch condition for
29:30
branches much better so one thing you can do and something that some is a is
29:36
do is what's called static branch prediction so for example you know in
29:42
general if you and this is very imperiled you basically have to take your application and take a look at how
29:48
many branches there are how many branches are taken so you'll see that most branches you know most applications
29:55
60 to 70% of the branches are actually taken and so if you're predicting PC plus 4 all the time you'll be incurring
30:01
you know 70% of the branches you'll be miss predicting them you can actually do some some analysis and say
30:09
well what happens if you know let's let's separate two cases in branches
30:14
right if I am jumping backwards so if I am jumping to some earlier PC some some
30:20
smaller PC than that of the branch instruction and you know what if the
30:25
branch jumps forward branch two jumps to
30:30
or branches to a later PC which you of course can distinguish by looking at the
30:38
offset right you don't need to compute a condition now it as it turns out
30:44
backward branches are taken you know over 90% of the time for our branches are taking about half
30:50
of the time can you guess why yes so
30:58
backward branches are loops right you're always going back to the beginning of the loop so it's much more frequently to
31:05
be taken for our branch systems tend to be if-else conditions right and so you
31:10
can start building things that you know building a little bit of logic that as certainly as the code says oh well you
31:17
know this is backward this is a backward branch I'm going to predict that it's taken and so I'm going to redirect
31:24
control flow earlier so so my assay is actually add instructions to give hints
31:32
to essentially tell the processor look this is a branch if not equal but assume
31:39
that it's going to be taken or assume that it's not going to be taken and so
31:47
by doing these by just adding these hints after the proper you know compiler analysis and so on we can achieve 80%
31:53
accuracy in branch prediction which is not bad we have two problems though first you have risk 5 so we cannot
32:00
change the is a and second is actually much easier you can do much better with
32:06
very simple hardware without resorting to these is a changes and so what most
32:14
processors do today is what we call dynamic branch prediction where essentially we have some device here
32:21
that given a PC compute some prediction and gives us some very fast prediction
32:28
of the next PC and then once we know what the right next PC is if this
32:35
predictor was wrong we basically give it some feedback to retraining so this is a feedback driven
32:42
predictor it is you know where when when it does something wrong we're updating
32:47
it so that it learns over time about the behavior of the program and this works because of two things
32:54
what we call temporal correlation so if you know it's often the case that the
32:59
way that our branch resolves they're taken or not taken is the same over time
33:04
so for example on a loop right we're always it's going to be taken most of
33:10
the time if that's another example we have some assertion in your code right
33:16
that's evaluating whether some condition is met but that's very rare you're gonna have a branch but that branch is
33:22
basically never going to be taken right and so it's very easy to predict for some simple hardware to predict you know
33:29
not not taken not taken not taken the other kind of of information that modern
33:36
processors leverage but we're not going to get to see is what we call spatial correlation which happens because in
33:41
many pieces of code you often have branches that will resolve in the same way in a correlated way and so that's
33:49
because you have some small number of paths of execution through through the code but you know this is more advanced
33:57
and we won't get to see it today now there are many different branch prediction strategies I am going to
34:04
focus on in a very simple one that's actually very useful for the design project and this is what we call the
34:11
branch target buffer or BTB in short and so the branch target buffer is a cache
34:18
for branch targets so here's how it works you know this is we take the PC
34:25
you know on at the beginning of fetch we index we take some bits of the PC and we
34:33
index into this table right which is basically a cache we have a tag we have the target PC for you know the predicted
34:42
PC for it for this PC and then we have a valid bit just like a cache right but
34:47
instead of being a cache for data this is a cache of predictions it's telling
34:54
us you know for this PC what is the next PC that you should run and so the way
35:00
that we're going to - this is if there's a hit we're going to take it ready there are some ads here
35:05
we're going to take the target and use it as an XP see if there isn't a hit or we hit on some invalid line then we'll
35:12
just use PC plus 4 and so the key idea here is that 4 taken branches and 4
35:18
jumps we are going to have an entry that tells us what the next PC should be now
35:23
we're going to train these this this BTB so that whenever we see that there was a
35:30
taken branch or a jump we're going to store the the PC target PC pair in in
35:37
this little cache and then when we run through it next we'll make the right prediction assuming that the branch is
35:42
still taken okay so in a little bit more
35:48
detail here's how it looks right so we want to predict the next PC immediately
35:53
and so we add this BTB right we feed the PC and we immediately get the prediction
35:59
for the next PC value so this instead of a lose loop is what we call a tight loop
36:05
right there's no this is this all happens on the same cycle and then when
36:11
we find out what the actual result is if there's a misprediction we still need to
36:17
go on adjust the BTB and adjust the PC right but the key difference here is
36:23
that we're going to take this penalty much less frequently if we were making good predictions right ok so some
36:34
implementation details so here's here's the same picture as before we're yeah also there's an instruction memory here so we're taking the whole PC and using
36:41
it to index into the instruction memory so one key thing a one key difference between BT B's and caches is that a BTB
36:50
is a cache for predictions and so if I give you a wrong value in a cache
36:57
because I'm not storing let's say the tags right or I'm not storing all the bits of the data and I'm making up some
37:03
some some bits then you'll just have you know your execution will be wrong but in
37:10
a BTB what happens if I give the wrong value
37:16
what happens if I you know give some PC some target PC that is not in fact the
37:23
right target PC yeah so there's some
37:29
penalty but it's still correct right so basically what you have here is a hockey
37:37
Reza trade rather than a correctness trade-off which isn't really a trade off at all and so the the point here is that
37:43
there are some very easy ways that you can make for a very small and efficient
37:48
BTB and I mention this specifically because if you want to get a good clock cycle time you might want to make your
37:54
bt be small by shaving of some bits so because it's just a prediction one thing
38:00
we can do is let's say that we don't store all the bits of the PC as the tag right so instead of using you know let's
38:09
say that you have 64 entries here and so you have six bits in the PC so there are
38:15
two bits of the offset that you're not going to use that leaves you with 24 remaining bits right that if you wanted
38:22
to get a full tag you would need 24 bits in the tab but it's just for four or
38:29
five bits of the tag right that's an easier maths take the the next four or five bits and if you have a conflict
38:36
well you'll return the wrong PC still fine right there will be some
38:43
misprediction but you're shaving off a whole lot of bits by just storing part of the PC yes
39:03
right exactly so what happens is that
39:10
you can have two jumps that have the same soul on the same index and have the
39:16
same four five tag bits that you're restoring and so you actually have a conflict there but that's okay you know
39:24
in that case you'll you'll have a misprediction this is actually not
39:43
assuming much about the style of assembly that you have there are some other more sophisticated predictors that
39:49
actually make some deeper assumptions but this is fairly general so you're
39:54
storing the target PC if it's there you use it if it's not you don't and so for
40:01
taking branches on jumps this tends to work well okay some other ways of saving
40:08
States you don't even need to store all the bits of the target PC right this is
40:14
instead of 32 bits you can store you know twelve bits of the target PC and then fill in the rest of the bits with
40:19
the with the current PC and again you'll have some incorrect mispredictions if
40:24
you're jumping far away but most jumps and branches are fairly close by right so same deal you'll get less prediction
40:32
accuracy but much less overhead yes I guess making shrinking tax and the
40:40
store PC so suppose that you have cycle time
40:48
problems in your FET stage right because by adding these now you need to index
40:53
into this table and how large you can make this table affects accuracy but also you know how wide it is is going to
41:00
determine how many Pico seconds you take from when you index into here until you
41:06
know you the match and get the target PC out right and so the problem is that the
41:13
trade of that you have here is that if you can reduce the size of you know the
41:19
number of bits here typically you can have a larger you can have more entries without increasing the clock cycle time
41:26
and so that is a good thing for accuracy accuracy generally train basically these
41:34
are implementation tricks that may or may not help you on the design project depending on how low a frequent or how
41:40
high a frequency you're targeting okay
41:46
yes yes and this this is no because at that
41:56
point you don't even know that it's a brand instruction so for every PC you're accessing the B TV this is why I'm also
42:03
insisting that this should be small and not not you know be some huge memory the
42:08
other the err thing here is that you're gonna have to implement this with
42:13
registers because you want combinational week right on the same cycle you want if we didn't get the data out as quickly as
42:18
possible and so typically you don't use that around for these you use you use registers all right great so another
42:29
thing you can do not even store the value bit although we'll see why the valid bits might be might be good in a
42:35
couple of slides so just as a you know hint even small BT bees can get you a
42:41
lot of performance improvement something a little bit more detailed you know this would be the interface for TB TB so you
42:49
give in you give it the current PC so you have a predict function that basically does a lookup to predict the
42:56
next busy in the fed stage and then this update method on a misprediction would be called from the execute stage when
43:04
you know you have a branch that is taken and you can predict the right next PC
43:10
then essentially execute would call this update method to train the predictor to
43:17
basically insert the right PC next PC pair there's also a you know another
43:24
subtle detail here sometimes you might be predicting a branch taken right and
43:30
so you have a valid PC next busy translation but it turns out that this branch is not-taken
43:35
and so you need to retrain the BTB the other way right you need you need to
43:42
delete the entry mark the entry is invalid so that next time you do PC plus 4 right it's actually you predict that
43:49
the branch is not-taken okay so this is a good way to improve performance of parts two and three of the design
43:56
project in fact for part two if you really work hard you can actually get you know do all the
44:04
all the redirection an ex PC computation in the Bicol state but that's if you
44:10
really really want to you know if you really work on releasing clock cycle
44:17
time you can do it under the 515 because I gonna target that that part 2 requires
44:23
but if you don't go that far it's actually really easy to use this BTB on
44:29
the FET stage and get good prediction accuracy ok so this brings us to you
44:34
know the last part of the lecture which is our bikini is doing a decent job at
44:39
storing targets and this is in fact the main way that modern processors decide where to jump but there's also you know
44:46
better things that we can do to predict whether our branch is taken or not so consider consider this very simple
44:52
loop where we have some instructions right and then we're decrementing a counter and when that counter is zero we
44:58
fall through and if it's not zero we jump back so how many miss predictions
45:04
are we going to have per loop with the BTB
45:16
so let's let's go through this so suppose that during the middle of the loop and you've already taken the branch
45:22
out a bunch of times right so now we keep going it's the branches take and take and take
45:27
and and the BTB is going to store the right next PC for that because we've trained it and so it already has the
45:33
right prediction right so in steady state we were going to be predicting take and take and take and take an 100%
45:39
run but now we get to the last iteration ran so this is not taken so in the last
45:46
iteration we miss predict because the BTB is saying know how this branch is
45:51
taken you should you should keep doing one more iteration of the loop and it
45:57
turns out that we fall through right so when we exit the loop we're going to have one miss prediction but what
46:06
happens when we go into the loop the next time
46:16
yes exactly so at that point we've trained
46:22
the BTB to say well no no no this is not taken it stopped being taken and so
46:28
we're going to take yet another misprediction when we start the loop so
46:34
what can you do to make this better well one thing you can one very simple optimization is to instead of having
46:42
just a ballad bit have two bits that we treat as a saturating counter to essentially give some hysteresis to the
46:48
process so every time that our branch is taken suppose that we have two bits per
46:53
BTB entry and you know on taken we're going to increase which which this is a
46:58
counter going from zero to three we're going to increase this counter on taken and we're going to decrease it on not
47:05
taken well we have a 0 or a 1 we're going to predict not-taken and when we
47:10
have a 2 or a 3 we're going to predict taken and so what happens in our loop now is that you know we're we're
47:18
predicting taken taken taken taken on each iteration so the branch is biased
47:25
that's strongly taken now we hit the last iteration of the loop and so it is
47:31
not take and so we go down to weekly taken but we still predict taken because it's in taken so many times before any
47:37
right thing to do you know it's more likely for these for these grants to be taken in the future and so when we hit then the first iteration of the loop
47:43
again we are going to see that it is taken and so we're going to go back to strongly taken and so you know how many
47:52
iterations how many mis-predictions per loop here we've gone from 2 down to 1 right ok so
48:02
at a very high level modern processors actually have many of these more
48:08
specialized predictors so they will they all have a BT B but then they have specialized predictors for branches for
48:15
branch direction for return address prediction so for example when you have function calls it will remember the PC
48:22
that your that that you're expected to return to on our return instruction it will feed that PC to as the next PC loop
48:29
predictors and so on and so base there are lots of predictors that are specialized the particular program
48:34
behaviors and along the pipeline the processor is trying to as soon as it gets more information it tries to if it
48:41
detects that it's gone down the wrong path or it believes it has gone down the wrong path it will try to redirect
48:47
control flow to reduce the misprediction penalty okay so that's it for today
48:53
modern processors basically are just relying on a handful of techniques deeper wider pipelines to get good CPI
49:02
ideal and then out of order execution and branch prediction to reduce the impact of hazards now these techniques have been very good
49:10
to get high sequential performance so you write a serial program it runs really quickly they unfortunately ran
49:17
out of steam in the mid-2000s and so you know since then we've moved to multi-core processors and integrate
49:23
multiple of these processing cores and this will be what we will discuss the next week but next lecture will still
49:31
stay on single core high performance processors and we'll see how to do bypassing in blue spec and other tips on
49:40
the design project all right thank you