Transcript Search in video 0:00 the goal of this lecture is to first give you a very broad introduction or a very high-level introduction to how 0:07 modern processors work and so this is so that you can you know when you write programs and when you try to optimize 0:14 the performance of your programs that you understand a high level what mechanisms are going on how is the 0:19 processor trying to help you improve performance and then the second part of the of the lecture we're going to focus 0:25 on branch prediction and and here we'll discuss a technique that is going to be quite useful for your design project ok 0:32 so just as a reminder this is the iron law of processor performance where 0:38 remember we want to divide or Express performance as the product of two of 0:43 three different terms the instructions that its program takes the cycles per instruction or CPI that it's that the 0:50 processor takes to execute the instructions in the program and the time per cycle or the clock cycle time right 0:57 and so you can improve any of these three and that will improve the performance of your system we're going 1:05 to focus as before in the last two terms in CPI and clock cycle time so remember 1:10 that when we introduced pipelining we said well you know pipelining is a good technique to reduce clock cycle time and 1:16 therefore improve clock cycle sorry improve clock frequency right that's the 1:22 reason why you have processors today that can run at 3 or 4 gigahertz that's 1:27 because they have you know 10 20 30 pipeline stages but the other component 1:34 here is CPI is the number of cycles per instruction that our processor takes on 1:39 average in a particular program and so let's be a little bit more systematic 1:44 about how we try to improve CPI by first dividing it into two different terms so 1:51 I'm going to say that the CPI the cycles per instruction can be divided into two components what we call CPI ideal which 1:58 is the cycles per instruction that your machine will achieve either if there were no stalls if there were no 2:04 mispredictions if essentially you could execute the maximum amount of instructions through the pipeline every 2:11 cycle Plus similar component that we're going to call a CPI hazard right this is the 2:16 contribution of data and control hazards to the cycles per instruction right so you can see hazards either you know do 2:24 two memory accesses do two operations that have dependencies on their values 2:30 as adding cycles right there adding cycles to each instruction that you're 2:36 executing and so on average this is going to contribute to the CVI but also 2:41 control hazards so you know when you have branches when you have jumps also when you have exceptions an exception 2:47 remember is an implicit branch going on every instruction and so that is going to add also some extra cycles into these 2:55 control hazards that we need to handle okay so we started with looking at this 3:02 classic five state spike right where we had you know fetch decode execute memory access and write back and so this has 3:11 you know it's a very simple design you're basically starting from something very similar even a little bit simpler 3:18 in the design project and it is still used if you don't want super high 3:25 performance so you know many chips used 3:30 in embedded devices and things like network routers like Wi-Fi access points and you know other devices will have for 3:39 example low-power ARM processors that implement precisely this this pipeline but this has some problems right even 3:46 though the ideal CPI is one you know there's well the upper performance bond 3:52 is the CPI of one we can actually do better we as we as we will discuss later 3:58 when you have a high latency instruction for example if you have a cache miss on the memory states you know the problem 4:05 is that now everything behind this instruction that causes a store also 4:10 stalls right so a single a single stall propagates immediately through through 4:17 the pipeline and of course you know you have long operations for example you have a mold 4:22 flyer that you'd like to pipeline and take you know three or four or five cycles now you need to dramatically 4:29 increase the number of stages a year in your design right and so just five 4:35 stages often doesn't suffice to have or to include enough time to do you know 4:42 complex operations okay so there are essentially four things that we can do 4:48 to improve the performance of this pipeline there are two that are fairly simple so we can keep pipelining right 4:57 keep dividing our pipeline into more and more stages to increase the clock frequency another you know conceptually 5:06 simple though mechanically quite complicated optimization is what we call wider pipelines where instead of 5:12 executing one instruction per cycle we're going to try to fetch decode execute and so on multiple instructions 5:18 for every cycle then the problem is that if we just apply these techniques our performance is not going to be good 5:25 because yes we're reducing the clock cycle time we're getting great frequency 5:30 we're reducing the ideal CPI but we still have to deal with hazards and so 5:35 there are basically two techniques that are used to deal with hazards when we talk about data hazards typically these 5:40 are handled by what we call out of order execution where instead of executing instructions in the order that they 5:46 appear in the program the processor will try to execute instructions as soon as their source operands are available and 5:53 the second technique that we will say more detail is essentially reducing the impact of control hazards through what 5:59 we call branch prediction but really this is prediction of branches and jumps and so we're going to try to do better 6:06 predictions than just PC plus 4 when we're trying to figure out what is the 6:11 next PC that our pipeline should fetch ok 6:17 so I'll explain the first three techniques fairly quickly and this is purely for demystification purposes it's 6:24 not going to be on the quiz this is just so that you understand when you you know have a program running in a modern 6:30 processor why is it that you know you I see that the code has a certain data 6:36 dependence and somehow processor actually is overlapping things in ways that you don't expect that's because of 6:41 some of these techniques okay so deeper pipelines essentially the 6:46 key idea here is if five stages were good then ten must be better than 15 must be better right so instead of 6:53 having five stages we can keep breaking up the work that we do into more and 6:58 more and more stages so instead of you know having a single a single cycle for fetching we can have two cycles which 7:04 allows us to clog the memory faster or to use larger memories we can break up decoding to decode and read read 7:11 registers and so on and so forth right and so you know this is good because 7:18 basically to a good degree this is the main weapon that modern processors use 7:24 to run at multiple multiple gigahertz so for example you know the signs in the early 2000s had around eleven twelve 7:30 stages in the mid 2000s you know there there was a period of time where the 7:36 main way to sell performance was frequency and so even if processors 7:42 didn't do much work per clock cycle basically processor designers got into 7:47 this race of how much come with pipeline these this processor so that we can sell 7:53 you processors that run at four gigahertz instead of three gigahertz because you know people want a single number to summarize performance and so 8:00 that's why you have designs like the Pentium 4 that has 34 pipeline stages it's a little bit on the high end and so 8:06 these days well you'll see the signs that have sixteen to twenty pipeline stages so there are some coasts of 8:14 course you know they all use one is you have more pipeline registers right but also you're gonna have more bypass paths 8:20 right so remember that you know if you have sequences of instructions here that have dependences you want to be able to 8:28 bypass the value the source operand that might be you know down in these stages 8:35 right before it is consumed so in this case at the output of the read register stage you will need a bypass path from 8:41 here from here from here and also from here right and so you need to add 8:46 more and more and more bypass spots over time you know remember maxes are expensive wires are also expensive and 8:53 so this doesn't scale very well okay and so all in all right the more you overlap 9:02 the more dependencies you're going to have right so the more instructions we have through the pipeline the more the 9:12 longer our dependencies are going among nearby instructions are going to hurt us 9:17 in terms of cycles right because you know things that used to have so for example if you know before and every 9:24 mispredicted branch we were incurring a penalty of two cycles or two instructions now you know our cycles 9:31 might be very fast but if we're resolving branches in states five we're going to incur a penalty of five 9:36 instructions right we're going to throw five instructions on every misprediction there is also the issue that at some 9:43 point clock clogging overhead things like set-up time of the registers 9:48 becomes dominant right so if you're spending more time on the set-up time of the registers and on the actual logic of 9:55 its pipeline States you actually don't get a lot of performance benefit and finally you know with very high clock 10:01 frequencies you basically increase power consumption by a lot and so this also 10:07 runs into limitations okay any questions 10:14 alright so the other broad technique that we can use is to instead of trying 10:21 to have this very high clock frequency have a wider pipeline where every state 10:28 processes multiple instructions per cycle so for example in this pipeline 10:33 over here I'm showing Ford if you know it's pipeline register has four different elements which represent four 10:39 different instructions right and so here the idea is that you know you're gonna have a very wide instruction cache or a 10:45 very wide instruction memory and instead of giving you 32 bits it would give you 128 bits right for instructions PC plus 10:53 for PC PC plus 4 plus 8 plus 12 right so you can fetch a chunk of four 10:59 instructions then you can decode them in parallel read their registers in parallel execute them in parallel you'll 11:05 need for Al use but you know you can do that even you need to access memory you can do this memory accesses in parallel 11:11 and so on so this that's the CPI ideal 11:18 component that I mentioned right so now if we have a pipeline of width W we're 11:24 going to have a CPI ideal of 1 over W right because in the best case it's 11:30 instruction takes 1 over W cycles in other words every cycle we're executing W instructions right okay so modern 11:42 processors will routinely try to be fetch decode execute three to four 11:49 instructions per cycle and so if you're writing high-performance code it helps 11:56 to you know do it in a way where you don't have very tight dependences 12:01 because of course if you have a dependence between two instructions in the same cycle right at some point your 12:08 the processor is not going to be able to push through all these dependent instructions through the pipeline in the 12:14 same cycle right you want to do so amount some measure of independent work okay the problem with this technique is 12:21 that actually there are a lot of quadratic overheads you need a wider instruction cash anymore I'll use you need a 12:27 register file that you know in this case instead of two reports has eight reports 12:32 and that adds a lot of wiring again and so you know all of these also comes at 12:40 the expense of when you have because you're trying to execute multiple instructions you need cycle as I said 12:46 you're the impact of hazards is going to grow right so you know you're going to 12:52 have many cases where you have dependent instructions here that cannot progress through the pipeline in the same cycle 12:58 and so you might need to hold off on some of the instructions that are fetched or decode it for a few cycles 13:04 okay so this is something that you could actually try to do in the design project it's kind of complicated but you know a 13:13 superscalar pipeline that tries to run a couple of instructions per cycle could be doable but there are simpler 13:20 techniques and you don't you absolutely do not need to do this to get full points okay so this brings us to the 13:28 issue of hazards and specifically you know data hazards so we have these these three strategies right stall bypass 13:35 speculate we use stalling and bypassing for data hazards we use either stolen or 13:41 speculation for control hazards so we said you know control hazards are allowed speculation because it's easy to 13:49 make good guesses right and so we'll see that it's actually easy to make really 13:55 good guesses in the last part of the of the lecture but now I want to focus on 14:00 data hazards and essentially introduce a fourth technique a fourth strategy which 14:05 is essentially to find something else to do if you have a an instruction that 14:12 depends on some value that's not yet available on some data value that's not dateable available for example you know 14:19 there's a load and then there's another instruction that's consuming that load well you could stall right it predicting 14:25 it doesn't really work bypassing it doesn't really work because you're still fetching it from memory but what if 14:32 there's some other instruction downstream right not the next instruction in program order 14:38 but some of the instruction later on that's actually independent that's operating on independent data and so 14:44 they the way to do this is through what we call out of order execution not a 14:50 very good marketing term you know you won't see lots of processors saying you 14:56 know we'll we'll sell you an out of order processor I think Intel calls this dynamic 15:01 execution or some other thing but you know the technical term is what we call 15:06 our further execution because essentially what we're trying to do is execute instructions in what we call 15:12 dataflow order or as soon as their input operands are available and so to see how 15:18 these works consider this expression you want to compute D you know it's three 15:23 times a minus B plus seven times a times C and so both a B and C are in memories 15:30 or you're gonna have to load them and then we're going to do some computations to peruse D and finally we're going to 15:35 store D also this one location in memory and so you could produce you know a compiler could produce the sequential 15:42 code write the sequence of instructions to compute D where it loads a then it 15:48 loads B then it subtracts a and B then it multiplies three times a minus B right and so it gets this part of the 15:55 term then it loads C because it needs C to produce the second part of the term multiply a times C set computes through 16:04 another multiply seven times a times C then finally adds everything up stores the result but not every instruction 16:13 here depends on the result of previous instructions right so instead we can 16:18 represent this computation as what we call a data flow graph and so in a data flow graph nodes or vertices represent 16:27 computations or operations and then edges between the vertices represent input or output data that's flowing 16:34 between these operations so here you can see that you know we can load B a and C 16:41 right and then this minus operation takes B and a as input right and so on 16:49 see through bye-bye expression things as this data flow graph you can see that not everything is dependent right in 16:55 fact if you have a stall in this load B 17:01 instruction you could actually run instructions down below here right load 17:07 C male seed mole 7 AC that don't depend on the UM B and so here's pictorially 17:17 how this this works right so the processor is trying to execute B all right so B has not completed execution 17:23 yet but it can make progress through operations that are not dependent on the 17:29 results of B and so it is you know in this case it has already completed the 17:36 execution of a times C it has completed two loads of a and C it is executing 17:42 seven times a times C and so on and so what processors will actually do is you 17:49 pass them this sequential code right this series of instructions they will 17:55 dynamically recover the data flow graph right they will dynamically recreate this this graph internally and then it 18:03 will run instruction as as as soon as they can okay so here's how how it all looks like 18:10 if you put all of these techniques together so you have you know a lots of different stages you know three or four 18:16 instructions per stage basically there's a first part of the processor that 18:24 fetches instructions in order and tries to decode them in order and then 18:29 essentially it reconstructs this data flow graph that stores all the dependencies between instructions then 18:36 we have what we call in out of order execution place where we execute its instruction using different functional 18:42 units as soon as its source operands are available and finally we write back the 18:48 result into the register file or into memory in program order so we have this last phase where there's some last stage 18:56 that is doing all the writes in order can you tell me why we need to do all 19:03 the rights in order yes No 19:20 but it does have to do with nice speculation so if so anyway yes 19:34 [Music] 19:52 so by looking at the straight line code it's hard to see but what's happening 19:59 here is we're using the same speculation tricks that we were using in our simpler pipelines so for every brand we are 20:07 predicting what the next address is going to be and then that means that we 20:14 might execute some work that we shouldn't have executed and because you can run work much more widely out of 20:20 order in these systems some instructions might actually make it all the way through execute and you might have the 20:26 results ready to be written back and yet that work should be discarded and so 20:32 that's why there's a last phase that reorders everything so this is a saddle point this is also you know when you 20:39 start looking at these complex pipelines and the interaction with other mechanisms like TLB some virtual memory 20:45 you know basically now you have all the ingredients that you need to understand all these modern security or all these 20:52 recent security attacks like meltdown inspector that essentially rely on this 20:58 out of order execution feature and so they're they're running work very far 21:04 ahead essentially misusing these this execution resources to leak information 21:09 okay so that concludes our discussion of the three techniques but so at a high 21:15 level what happens is you have this very long pipeline right you have this technique that's trying to execute 21:20 instructions based on their dependences rather than in their order in the program and now they usually that we 21:27 have remaining and what I want to focus on for the rest of the lecture is how to do it handle control flow hazards and so 21:34 you end up with these fairly long pipelines right and you know in recent 21:42 processors you'll take anywhere from 1015 cycles from when a branch is 21:48 fetched until it is resolved until it is executed and you actually know for sure whether the next PC was what you 21:57 predicted or something else and so this is what we call a loose loop we say that 22:04 you know when you have a lot of pipeline stages between two events that essentially create a dependence through 22:11 the pipeline this is a loose loops and it's a loose loop because you know 22:17 there's more than one pipeline stage between them and that's going to impact performance okay so how much how much 22:26 performance do we lose every time that the pipeline doesn't follow correct instruction flow so if we have a branch 22:33 here and we miss predict the next PC we send it down and then let's say 15 22:39 cycles later we figure out that that's the wrong path and so we need to 22:45 redirect the pipeline how much work have we lost here 22:55 yes is it about the same work that we do to like set up the pipeline yeah because 23:00 everything behind it is gonna be invalid exactly exactly so if you were executing one instruction 23:06 in its States right it would just be the number of stages right because that's 23:11 how many instructions you could have implied in fact it's even worse because your pipelines can execute multiple 23:16 instructions per cycle right so in modern processors each branch 23:24 misprediction will waste forty to fifty instructions and so the problem is that 23:31 branches happen every five to twenty instructions so if every five to twenty instructions let's say we're miss 23:37 predicting 50% of the branches because we're just doing PC plus 4 we're predicting PC plus 4 then once every 23:45 let's say 10 instructions we are adding all this waste right we're throwing away 23:52 30 instructions is this good like what's 24:00 the performance impact of that if once every ten instructions we're taking 30 24:07 instructions worth of work that come after I'm throwing them out right 24:13 turning them into no ops and then restarting an execution then you have 24:18 ten ten instructions worth of useful work 30 instruction was wasted right so your performance impact is 4x right 24:26 you're wasting most of your execution resources because you're not holding the 24:32 right path yes 24:52 absolutely so you are losing work because you're executing work that you end up discarding now if you were able 24:59 to take better guesses you wouldn't be losing work so imagine that in the best 25:05 case you have some magical device here that always told you what the next PC was right and so in that case you would 25:13 be able to do much better this is this is why you know I say that we're losing work and in fact what we're going to see 25:19 in the rest of the lecture is that you could get very close you can predict ninety nine ninety nine point five 25:25 percent of branches and modern processors take one misprediction or have one beat spurt in this prediction 25:31 every thousand cycles or so okay so again the name of the game here is to 25:38 improve the accuracy of guesses and so let's go back to you know risk five 25:45 branches and jumps and so remember that control flow hazards happen because each instruction sets depends on information 25:52 from the preceding instruction or you know some preceding summer of preceding instructions ahead so we want to know 25:59 two things is this a branch or a jump right is this 26:04 a is this something that alters control flow or is it going to go to PC plus 4 26:10 and if it is taken if it's a taken branch or a jump then what is the target 26:15 when can what is the earliest point at which we can know the target and so four 26:21 different instructions in instruction types you'll have different points in the pipeline at which you'll know now 26:29 you know as I as I mentioned a few lectures ago this is specific to your 26:34 pipeline but let's go through this exercise by thinking you know when what is the earliest that we can find you 26:42 know both whether it is at whether it's taken and what if it's taken what's the 26:49 target is so for a jump on link right a jump on link computes the next PC using 26:55 the immediate and the current PC right so when do you know it's taken what's 27:01 the earliest you can know in the that it's taken yes 27:06 decode yep once once you fetch the instruction you can take a look and save 27:12 you up this is this is a jump on link that's the opcode so I know that I'm going to have to redirect the PC when do 27:19 you know the target when do you know what's what the next PC is almost if 27:30 this was a jump on link register then yes but this is a jump on link and so the next PC comes from the PC plus 27:37 immediate yes yeah 27:42 as early as the code you can stick an ally in there and compute the next PC 27:47 right so jump and Link is sort of the best case so jump on link register taken 27:54 same thing right we can take a look at the opcode and know that it's a jump and 27:59 therefore we're going to change the PC for sure after decode but when is the target known after execute right so you 28:11 need to at least read the register and do some computation with it now you could have some pipeline where you have 28:16 a register read stage and you know you stick another in there and you know if the timing fits you could know the 28:23 target a little bit earlier to write how about branches when you know it's taken 28:33 yes addict after executed exactly right 28:38 because we need to read the registers compare them figure out whether the 28:43 branch condition is met or not and that's gonna take some time when is the target node then though so 28:51 the target for for jump on link and the target for branches is exactly the same it's exactly the same computation so you 28:57 actually know the target in decode okay so armed with all of these let's start 29:05 seeing how we can improve performance with better branch prediction as I said 29:11 the name of the game here is to make good guesses - we've been guessing PC 29:18 plus 4 as the next value of the value of the next PC but we're going to discuss a 29:25 few strategies that let us predict both the target and the branch condition for 29:30 branches much better so one thing you can do and something that some is a is 29:36 do is what's called static branch prediction so for example you know in 29:42 general if you and this is very imperiled you basically have to take your application and take a look at how 29:48 many branches there are how many branches are taken so you'll see that most branches you know most applications 29:55 60 to 70% of the branches are actually taken and so if you're predicting PC plus 4 all the time you'll be incurring 30:01 you know 70% of the branches you'll be miss predicting them you can actually do some some analysis and say 30:09 well what happens if you know let's let's separate two cases in branches 30:14 right if I am jumping backwards so if I am jumping to some earlier PC some some 30:20 smaller PC than that of the branch instruction and you know what if the 30:25 branch jumps forward branch two jumps to 30:30 or branches to a later PC which you of course can distinguish by looking at the 30:38 offset right you don't need to compute a condition now it as it turns out 30:44 backward branches are taken you know over 90% of the time for our branches are taking about half 30:50 of the time can you guess why yes so 30:58 backward branches are loops right you're always going back to the beginning of the loop so it's much more frequently to 31:05 be taken for our branch systems tend to be if-else conditions right and so you 31:10 can start building things that you know building a little bit of logic that as certainly as the code says oh well you 31:17 know this is backward this is a backward branch I'm going to predict that it's taken and so I'm going to redirect 31:24 control flow earlier so so my assay is actually add instructions to give hints 31:32 to essentially tell the processor look this is a branch if not equal but assume 31:39 that it's going to be taken or assume that it's not going to be taken and so 31:47 by doing these by just adding these hints after the proper you know compiler analysis and so on we can achieve 80% 31:53 accuracy in branch prediction which is not bad we have two problems though first you have risk 5 so we cannot 32:00 change the is a and second is actually much easier you can do much better with 32:06 very simple hardware without resorting to these is a changes and so what most 32:14 processors do today is what we call dynamic branch prediction where essentially we have some device here 32:21 that given a PC compute some prediction and gives us some very fast prediction 32:28 of the next PC and then once we know what the right next PC is if this 32:35 predictor was wrong we basically give it some feedback to retraining so this is a feedback driven 32:42 predictor it is you know where when when it does something wrong we're updating 32:47 it so that it learns over time about the behavior of the program and this works because of two things 32:54 what we call temporal correlation so if you know it's often the case that the 32:59 way that our branch resolves they're taken or not taken is the same over time 33:04 so for example on a loop right we're always it's going to be taken most of 33:10 the time if that's another example we have some assertion in your code right 33:16 that's evaluating whether some condition is met but that's very rare you're gonna have a branch but that branch is 33:22 basically never going to be taken right and so it's very easy to predict for some simple hardware to predict you know 33:29 not not taken not taken not taken the other kind of of information that modern 33:36 processors leverage but we're not going to get to see is what we call spatial correlation which happens because in 33:41 many pieces of code you often have branches that will resolve in the same way in a correlated way and so that's 33:49 because you have some small number of paths of execution through through the code but you know this is more advanced 33:57 and we won't get to see it today now there are many different branch prediction strategies I am going to 34:04 focus on in a very simple one that's actually very useful for the design project and this is what we call the 34:11 branch target buffer or BTB in short and so the branch target buffer is a cache 34:18 for branch targets so here's how it works you know this is we take the PC 34:25 you know on at the beginning of fetch we index we take some bits of the PC and we 34:33 index into this table right which is basically a cache we have a tag we have the target PC for you know the predicted 34:42 PC for it for this PC and then we have a valid bit just like a cache right but 34:47 instead of being a cache for data this is a cache of predictions it's telling 34:54 us you know for this PC what is the next PC that you should run and so the way 35:00 that we're going to - this is if there's a hit we're going to take it ready there are some ads here 35:05 we're going to take the target and use it as an XP see if there isn't a hit or we hit on some invalid line then we'll 35:12 just use PC plus 4 and so the key idea here is that 4 taken branches and 4 35:18 jumps we are going to have an entry that tells us what the next PC should be now 35:23 we're going to train these this this BTB so that whenever we see that there was a 35:30 taken branch or a jump we're going to store the the PC target PC pair in in 35:37 this little cache and then when we run through it next we'll make the right prediction assuming that the branch is 35:42 still taken okay so in a little bit more 35:48 detail here's how it looks right so we want to predict the next PC immediately 35:53 and so we add this BTB right we feed the PC and we immediately get the prediction 35:59 for the next PC value so this instead of a lose loop is what we call a tight loop 36:05 right there's no this is this all happens on the same cycle and then when 36:11 we find out what the actual result is if there's a misprediction we still need to 36:17 go on adjust the BTB and adjust the PC right but the key difference here is 36:23 that we're going to take this penalty much less frequently if we were making good predictions right ok so some 36:34 implementation details so here's here's the same picture as before we're yeah also there's an instruction memory here so we're taking the whole PC and using 36:41 it to index into the instruction memory so one key thing a one key difference between BT B's and caches is that a BTB 36:50 is a cache for predictions and so if I give you a wrong value in a cache 36:57 because I'm not storing let's say the tags right or I'm not storing all the bits of the data and I'm making up some 37:03 some some bits then you'll just have you know your execution will be wrong but in 37:10 a BTB what happens if I give the wrong value 37:16 what happens if I you know give some PC some target PC that is not in fact the 37:23 right target PC yeah so there's some 37:29 penalty but it's still correct right so basically what you have here is a hockey 37:37 Reza trade rather than a correctness trade-off which isn't really a trade off at all and so the the point here is that 37:43 there are some very easy ways that you can make for a very small and efficient 37:48 BTB and I mention this specifically because if you want to get a good clock cycle time you might want to make your 37:54 bt be small by shaving of some bits so because it's just a prediction one thing 38:00 we can do is let's say that we don't store all the bits of the PC as the tag right so instead of using you know let's 38:09 say that you have 64 entries here and so you have six bits in the PC so there are 38:15 two bits of the offset that you're not going to use that leaves you with 24 remaining bits right that if you wanted 38:22 to get a full tag you would need 24 bits in the tab but it's just for four or 38:29 five bits of the tag right that's an easier maths take the the next four or five bits and if you have a conflict 38:36 well you'll return the wrong PC still fine right there will be some 38:43 misprediction but you're shaving off a whole lot of bits by just storing part of the PC yes 39:03 right exactly so what happens is that 39:10 you can have two jumps that have the same soul on the same index and have the 39:16 same four five tag bits that you're restoring and so you actually have a conflict there but that's okay you know 39:24 in that case you'll you'll have a misprediction this is actually not 39:43 assuming much about the style of assembly that you have there are some other more sophisticated predictors that 39:49 actually make some deeper assumptions but this is fairly general so you're 39:54 storing the target PC if it's there you use it if it's not you don't and so for 40:01 taking branches on jumps this tends to work well okay some other ways of saving 40:08 States you don't even need to store all the bits of the target PC right this is 40:14 instead of 32 bits you can store you know twelve bits of the target PC and then fill in the rest of the bits with 40:19 the with the current PC and again you'll have some incorrect mispredictions if 40:24 you're jumping far away but most jumps and branches are fairly close by right so same deal you'll get less prediction 40:32 accuracy but much less overhead yes I guess making shrinking tax and the 40:40 store PC so suppose that you have cycle time 40:48 problems in your FET stage right because by adding these now you need to index 40:53 into this table and how large you can make this table affects accuracy but also you know how wide it is is going to 41:00 determine how many Pico seconds you take from when you index into here until you 41:06 know you the match and get the target PC out right and so the problem is that the 41:13 trade of that you have here is that if you can reduce the size of you know the 41:19 number of bits here typically you can have a larger you can have more entries without increasing the clock cycle time 41:26 and so that is a good thing for accuracy accuracy generally train basically these 41:34 are implementation tricks that may or may not help you on the design project depending on how low a frequent or how 41:40 high a frequency you're targeting okay 41:46 yes yes and this this is no because at that 41:56 point you don't even know that it's a brand instruction so for every PC you're accessing the B TV this is why I'm also 42:03 insisting that this should be small and not not you know be some huge memory the 42:08 other the err thing here is that you're gonna have to implement this with 42:13 registers because you want combinational week right on the same cycle you want if we didn't get the data out as quickly as 42:18 possible and so typically you don't use that around for these you use you use registers all right great so another 42:29 thing you can do not even store the value bit although we'll see why the valid bits might be might be good in a 42:35 couple of slides so just as a you know hint even small BT bees can get you a 42:41 lot of performance improvement something a little bit more detailed you know this would be the interface for TB TB so you 42:49 give in you give it the current PC so you have a predict function that basically does a lookup to predict the 42:56 next busy in the fed stage and then this update method on a misprediction would be called from the execute stage when 43:04 you know you have a branch that is taken and you can predict the right next PC 43:10 then essentially execute would call this update method to train the predictor to 43:17 basically insert the right PC next PC pair there's also a you know another 43:24 subtle detail here sometimes you might be predicting a branch taken right and 43:30 so you have a valid PC next busy translation but it turns out that this branch is not-taken 43:35 and so you need to retrain the BTB the other way right you need you need to 43:42 delete the entry mark the entry is invalid so that next time you do PC plus 4 right it's actually you predict that 43:49 the branch is not-taken okay so this is a good way to improve performance of parts two and three of the design 43:56 project in fact for part two if you really work hard you can actually get you know do all the 44:04 all the redirection an ex PC computation in the Bicol state but that's if you 44:10 really really want to you know if you really work on releasing clock cycle 44:17 time you can do it under the 515 because I gonna target that that part 2 requires 44:23 but if you don't go that far it's actually really easy to use this BTB on 44:29 the FET stage and get good prediction accuracy ok so this brings us to you 44:34 know the last part of the lecture which is our bikini is doing a decent job at 44:39 storing targets and this is in fact the main way that modern processors decide where to jump but there's also you know 44:46 better things that we can do to predict whether our branch is taken or not so consider consider this very simple 44:52 loop where we have some instructions right and then we're decrementing a counter and when that counter is zero we 44:58 fall through and if it's not zero we jump back so how many miss predictions 45:04 are we going to have per loop with the BTB 45:16 so let's let's go through this so suppose that during the middle of the loop and you've already taken the branch 45:22 out a bunch of times right so now we keep going it's the branches take and take and take 45:27 and and the BTB is going to store the right next PC for that because we've trained it and so it already has the 45:33 right prediction right so in steady state we were going to be predicting take and take and take and take an 100% 45:39 run but now we get to the last iteration ran so this is not taken so in the last 45:46 iteration we miss predict because the BTB is saying know how this branch is 45:51 taken you should you should keep doing one more iteration of the loop and it 45:57 turns out that we fall through right so when we exit the loop we're going to have one miss prediction but what 46:06 happens when we go into the loop the next time 46:16 yes exactly so at that point we've trained 46:22 the BTB to say well no no no this is not taken it stopped being taken and so 46:28 we're going to take yet another misprediction when we start the loop so 46:34 what can you do to make this better well one thing you can one very simple optimization is to instead of having 46:42 just a ballad bit have two bits that we treat as a saturating counter to essentially give some hysteresis to the 46:48 process so every time that our branch is taken suppose that we have two bits per 46:53 BTB entry and you know on taken we're going to increase which which this is a 46:58 counter going from zero to three we're going to increase this counter on taken and we're going to decrease it on not 47:05 taken well we have a 0 or a 1 we're going to predict not-taken and when we 47:10 have a 2 or a 3 we're going to predict taken and so what happens in our loop now is that you know we're we're 47:18 predicting taken taken taken taken on each iteration so the branch is biased 47:25 that's strongly taken now we hit the last iteration of the loop and so it is 47:31 not take and so we go down to weekly taken but we still predict taken because it's in taken so many times before any 47:37 right thing to do you know it's more likely for these for these grants to be taken in the future and so when we hit then the first iteration of the loop 47:43 again we are going to see that it is taken and so we're going to go back to strongly taken and so you know how many 47:52 iterations how many mis-predictions per loop here we've gone from 2 down to 1 right ok so 48:02 at a very high level modern processors actually have many of these more 48:08 specialized predictors so they will they all have a BT B but then they have specialized predictors for branches for 48:15 branch direction for return address prediction so for example when you have function calls it will remember the PC 48:22 that your that that you're expected to return to on our return instruction it will feed that PC to as the next PC loop 48:29 predictors and so on and so base there are lots of predictors that are specialized the particular program 48:34 behaviors and along the pipeline the processor is trying to as soon as it gets more information it tries to if it 48:41 detects that it's gone down the wrong path or it believes it has gone down the wrong path it will try to redirect 48:47 control flow to reduce the misprediction penalty okay so that's it for today 48:53 modern processors basically are just relying on a handful of techniques deeper wider pipelines to get good CPI 49:02 ideal and then out of order execution and branch prediction to reduce the impact of hazards now these techniques have been very good 49:10 to get high sequential performance so you write a serial program it runs really quickly they unfortunately ran 49:17 out of steam in the mid-2000s and so you know since then we've moved to multi-core processors and integrate 49:23 multiple of these processing cores and this will be what we will discuss the next week but next lecture will still 49:31 stay on single core high performance processors and we'll see how to do bypassing in blue spec and other tips on 49:40 the design project all right thank you