1 00:00:01,069 --> 00:00:03,189 the following content is provided under a Creative Commons license your support will help MIT OpenCourseWare continue to offer high quality educational resources for free to make a donation or to view additional materials from hundreds of MIT courses visit MIT opencourseware at ocw.mit.edu all right so we've talked a little bit about caching before but today we're going to talk in much more detail about caching and how to design cache efficient algorithms so first let's look at the caching Hardware on modern machines today so here's what the cache hierarchy looks like for a multi-core chip we have a whole bunch of processors they're all they all have their own private l1 caches for both the data as well as the instruction they also have a private l2 cache and then they share a last level cache or l3 cache which is also called LLC they're all connected to a memory controller that can access DRAM and then oftentimes you'll have multiple chips on the same same server and these chips would be connected through a network so here we have a bunch of multi-core chips that are connected together so we can see that there are different levels of memory here and the sizes of each one of these levels of memory is different so the sizes tend to go up as you move up the memory hierarchy the l1 caches tend to be about 32 kilobytes in fact these are the specifications for the machines that you're using in this class so 32 kilobytes for both the l1 data cache in the l1 instruction cache 256 kilobytes for the l2 cache so the l2 cache tends to be about 8 to 10 times larger than l1 cache and then the last level cache the size is 30 megabytes so this is typically on the order of tens of megabytes and then DRAM is on the order of Giga bytes so we have 128 gigabyte DRAM and nowadays you can actually get machines that have terabytes of DRAM so the associativity tends to go up as you move up the cache are key and I'll talk more about social scientific on the next couple slides the time to access the memory also tends to go up so the latency tends to go up as you move up the memory hierarchies so the l1 caches are the quickest to access so about 2 nanoseconds you're just rough numbers the l2 cache is a little bit slower so say for nanoseconds last level cache maybe 6 nanoseconds and then when you have to go to DRAM it's about an order of magnitude slower so 50 nanoseconds in this example and the reason why the memories are further down in the cache mark you're faster is because they're using more expensive materials to manufacture these things but since they tend to be more expensive we can't fit as much of that on the machines so that's why the faster memories are smaller than the slower memories but if we're able to take advantage of locality in our programs and we can make use of the fast memory as much as possible we'll talk about ways to do that in this lecture today there's also the latency across the network which tends to be cheaper than going to main memory but slower than doing a last level cache access and there's a lot of work in trying to get the cache coherence protocols right as we mentioned before so since these processors all have private caches we need to make sure that they all see a consistent view of memory when they're trying to access the same memory addresses in parallel so we talked about the MSI cache protocol before and there are many other protocols out there and you can read more about these things online but these are very hard to get right and there's a lot of verification involved in trying to prove that the cache coherence protocols are correct so any questions so far okay so let's talk about the associativity of a cache so here I'm showing you a fully associative cache and in a fully associative cache a cache block can reside anywhere in the cache and the basic unit of movement here is a cache block in this example the cache block size is 4 bytes but on the on the machines that we're using for this class the cache block size is 64 bytes but for this example I'm going to use a 4 byte cache line so each row here corresponds to one cache line in a fully associative cache means that each line here can go anywhere in the cache and then here we're also assuming a cache size that has 32 bytes so in total it can store a cache lines since the cache line is 4 bytes so to find a block in a fully associative cache you have to actually search the entire cache because a cache line can appear anywhere in the cache and there's a tag associated with each of these cache lines here that that basically specify which of the memory addresses in virtual memory space it corresponds to so for the fully associative cache we're actually gonna use most of the bits of the address as a tag we don't actually need the two lower order bits because the things are being moved that degree and the granularity of cache lines which are 4 bytes so the two lower order bits are always going to be the same but we're just gonna use the rest of the bits to store the tag so if our address space is 64 bits then we're gonna use 62 bits to store the tag in a fully associative cache in scheme and when a cache becomes full a block that has to be victim a chrome for a new block and there are various ways that you can decide how to evict a block so this is known as the replacement policy one common replacement policy is LRU or least recently used so you basically kick the thing out that has been used the farthest in the past dollar schemes such as second chance and clock replacement we're not going to talk too much about the different replacement schemes today but you can feel free to read about these things online so what's a disadvantage of this scheme yes yeah why is it slow yeah so so the disadvantage is that searching for a cache line in the cache can be pretty slow because you have to search the entire cache in the worst case since a cache block can reside anywhere in the cache so even though the search can go on in parallel and hardware still expensive in terms of power and performance to have to search most of the cache every time so let's look at another extreme this is a direct mapped cache so in a direct mapped cache each cache block can only go in one place in the cache so I've color-coded these cache blocks here so the red blocks can only go in the first row of this cache the orange ones can only go in the second row and so on and the position which cache block can go into is known as that cache blocks set so the set determines the location in the cache for each particular block so let's look at how the virtual memory address is divided up into and which of the bits we're gonna use to figure out where a cache block should go in the cache so we have the offset we have the set and then the tag fields the offset just tells us which position we want to access within a cache block so since a cache block has B bytes we only need log base two of B bits that's offset and the reason why we have to also is because we're not always accessing something at the beginning of a cache block we might want to access something in the middle and that's why we need to offset to specify where in the cache block we want to access then there's a set field and the set field is going to determine which position in the cache that cache block can go into so there are eight possible positions for each cache block here and therefore we only need log base two of eight bits so three bits for the set in this example and more generally it's going to be log base 2 of M over B and here M over B is eight and then finally we're gonna use the remaining bits as a tag so W minus log base 2 of M bits for the tag and that gets stored along with the cache block in the cache and that's going to uniquely identify which of the which of the memory blocks the the cache block corresponds to in virtual memory and you can verify that the sum of all of these quantities here sums to W bits so in total we have a W bit address space and the sum of those three things is W so what's the advantage and disadvantage of this scheme so first what's the what's a good thing about this scheme compared to the previous scheme that we saw yes yes fast because you only have to check one place because each cache block can only go in one place in the cache and that's the only place you have to check when you try to do a lookup if the cache block is there then you find it if it's not then you know it's not in the cache what's the downside to this scheme yeah yeah so it's a good answer so the downside is that you might for example or just be accessing the rad cache blocks and then not accessing any of the other cache blocks they'll all get mapped to the same location in the cache and then they'll keep keep evicting each other even though there's a lot of empty space in the cache and this is known as a conflict miss and these can be very bad for performance and very hard to debug so that's one downside of a direct mapped cache is that you can get these conflict misses where you have to evict things from the cache even though there's empty space in the cache so as we said finding a block is very fast only a single location in the cache has to be searched but you might suffer from conflict misses if you keep accessing things in the same set repeatedly without accessing the things and the other sets so any questions okay so these are sort of the two extremes for cash design there's actually a hybrid solution called set associative cache and in a set associative cache you still have sets but each of the sets contains more than one line now so all the red blocks still map to the red set but there's actually two possible locations for the red blocks now so in this case this is known as a 2-way associative cache since there are two possible locations inside each set and again a cache blocks set determines K possible cache locations for that block so within a set it's fully associative but each block can only go in one of the sets so let's look again at how the bits are divided into in the address so we still have the tag set and offset fields the offset field is still log base 2 of B the set field is going to take log base 2 of M over KB bits so the number of sets we have is M over KB so we need log base 2 of that that number to represent the set of a block and then finally we use the remaining bits as a tag so it's going to be W minus log base 2 of M over K and now to find a block in the cache only K locations of the of its set must be searched so you basically find which set the cache block maps to and then you check all K locations within that set to see if that cache block is there and whenever you want to whenever you try to put something in the cache because it's not there you have to evict something and you've X something from the same set as the block that you're placing into the cache so for this example I showed a 2-way associative cache but in practice the associativity is usually bigger say 8 Way 16 way or sometimes 20 way and and as you keep increasing the associativity it looks it's gonna look more and more like a fully associative cache and if you have a one-way associative cache then it's just a direct mapped cache so this is a sort of a hybrid in-between a fully mapped cache and a fully associative cache in a direct mapped cache so any questions on set associative caches okay so let's go over a taxonomy of different types of cache misses that you can incur so the first type of cache miss is called a cold miss and this is a cache miss that you have to incur the first time you access a cache block and if you need to access this piece of data there's no way to get around getting a cold miss for this because your as cache starts out not having this block and the first time you access it you have to bring it into cache then there are capacity misses so capacity misses our cache misses you get because the cache is full and it can't fit all of the cache blocks that you want to access so you get a capacity miss when the previous cache copy would have been you victim 'van with a fully associative scheme so even if all of the possible locations in your cache could be used for a particular cache line that cache line still has to be evicted because there's not enough space so that's what's called a capacity miss and you can deal with capacity misses by introducing more locality into your code both spatial in temporal locality and we'll look at ways to reduce the capacity misses of algorithms later on in this lecture then Darr conflict misses and conflict misses happen in a set associative caches when you have too many blocks from the same set mapping to the wanting to go into the cache and some of these have to be evicted because the sack can't fit all of the blocks and these blocks wouldn't have been you Victor do had a fully associative schemes these are what's called conflict misses you have to for example if you have 16 things in a set and you keep accessing 17 things that all belong in the set something's gonna get kicked out every time you want to access something and these cache evictions might not have happened if you had a fully associative cache and then finally there are sharing misses so sharing misses only happened in a parallel context and we talked a little bit about - sharing a false sharing misses in Prior lectures so let's just review this briefly so sharing this can happen if multiple processors are accessing the same cache line and at least one of them is writing to that cache line if all the processors are just reading from the cache line then the cache coherence protocol knows how to make it work so that you don't get cache misses they can all access the same cache line at the same time if nobody's modifying it but if at least one processor is modifying it you could get either true sharing misses or false sharing misses so a true sharing miss is when two processors are accessing the same data on the same cache line and as you recall from a previous lecture if one of the two processors is writing to this cache line whenever it does a write in that it needs to acquire the cache line in exclusive mode and then invalidate the that cache line in all other caches so then when when another processor tries to access the same memory location has to bring it back into its own cache and then you get a cache miss there a false sharing miss happens if two processes are accessing different data that just happened to reside on the same cache line because the basic unit of movement is a cache line in the architecture so even if you're accessing different things if they are on the same cache line you're still going to get a sharing miss and fall sharing is pretty hard to deal with because in general you don't know what data gets placed on what cache line there's certain heuristics you can use for example of your mouth looking a big we region you know that that memory region is contiguous so you can space your access is far enough apart by different processors so they don't touch the same cache line so if you're just declaring local variables on the stack you don't know where the compiler is going to decide to place these variables in the virtual memory address space so these are for different types of cache misses that you should know about and there's many models out there for analyzing the cache performance of algorithms and some of the models ignore some of these different types of cache misses so just be aware of this when you're looking at algorithm analysis because not all the models will capture all of these different types of cache misses so let's look at a bad case for conflict misses so here I'm I have a I want to access a sub matrix within a larger matrix and recall that the matrices are stored in row major order and let's say our matrix is 4096 columns by 4096 rows and it still stores doubles so therefore each row here is going to contain two to the fifteenth bytes because 4096 is 2 to the 12 and we have doubles each one of which takes 8 bytes so 2 to 12 times 2 to the third which is 2 to 15th we're gonna assume the word width is 64 which is standard we're gonna assume that we have a cache size of 32 K and the cache block size is 64 which again is standard and let's say we have a four-way associative cache so let's look at how the how the bits are divided into so again we have this offset which takes log base 2 of B bits so how many bits do we have for the offset in this example all right so we have six bits so it's just log base 2 of 64 what about for this set how many bits do we have for that seven who said seven yeah so so it is seven so M is 32 K which is to 215th and then k is 2 to the 2 b is 2 to the 2 to the 6 so it's 2 to 15 divided by 2 the 8 which is 2 to 7th and log base 2 of that is 7 and finally what about the tag field 51 why is that yes it's just 64 minus seven minus six which is 51 okay so let's say that we want to access a sub matrix within this larger matrix let's say we want to access a 32 by 32 sub matrix and this is pretty common in matrix algorithms where you want to access sub matrices especially in dividing conquer algorithms and let's say we want to access a column of this sub matrix a so the addresses of the elements that we're going to access are as follows so let's say we the first element in a column is sort out address X then the second element in the column is going to be stored at address X plus two to the fifteenth because each row has two to the fifteenth bytes and we're skipping over an entire row here to get to the element in the second in the next row of the sub matrix so we're gonna add to the fifteenth and then to get the third element we're gonna add two times two to the fifteenth and so on until we get to the last element which is X plus 31 times two to the fifteenth so which the olds of the address are changing as we go through one column of this sub matrix yeah yeah so it's just gonna be the tag that's changing the set and the offset are gonna stay the same because we're just using the lower 13 bits to store the setna tag and therefore when we increment by to the 15th we're not going to touch the set in the offset so all of these addresses fall into the same set and this is a problem because ours us our cache is only 4-way associative so we can only fit for cash for cache lines in each set and here we're accessing 31 of these things so by the time we get to the next column of a all the things that we access in the current column of a are going to be evicted from cache already and this is known as a conflict miss because if if you had a fully associative cache this might not have happened because you could actually use any location in the cache to store these cache blocks so is anybody have any questions on why we get conflict misses here so anybody have any ideas on how to fix this so what can I do to make it so that I'm not incrementing by to the exactly to the 15th every time yeah yeah yeah yeah so one solution is to Pat the matrix you can add some constant amount of space to the end of the matrix so each row is gonna be longer than to the 15th bytes so maybe you add some small constant like 17 so add 17 bytes the end of each row and now when you act as a column of this sub matrix you're not just incrementing by to the 15th you're also adding some small integer and that's going to cause the set and offset fields to change as well and you're not going to get as many conflict misses so that's one way to solve the problem it turns out that if you're doing a matrix multiplication algorithm that's a cubic work algorithm and you can basically afford to copy the sub matrix into a temporary 32 by 32 matrix do all the operations on the temporary matrix and then copy it back out to the original matrix the copy copying only takes quadratic work to do across the whole algorithm and since the whole algorithm takes cubic work the quadratic work is a lower order term so you can also you can use temporary space to make sure that you don't get conflict misses any questions okay so this was conflict misses so conflict misses are important but usually we're going to be first concerned about getting good spatial and temporal locality because those are usually the higher-order factors in the performance of a program once we get good spatial and temporal locality in our program we can then start worrying about conflict misses for example by using temporary space or padding or our data by some small constants so that we don't have as many conflict misses okay so now I want to talk about a model that we can use to analyze the cache performance of algorithms and this is called the ideal cache model so in this model we have a two level cache hierarchy so we have the cash and then main memory the cache size is of size M and the cache line sizes of B bytes and therefore we can fit M over B cache lines inside our cache this model assumes that the cache is fully associative so any cache block can go anywhere in the cache and it also assumes an optimal Nishant replacement policy so this means that what we want to evict a cache block from the cache we're going to pick the thing to evict that gives us the best performance over all it gives us the lowest number of cache misses throughout our entire algorithm so we're assuming that we know the sequence of memory requests throughout the entire algorithm and that's why it's called the nisshin replacement policy and if something is in cache you can operate on it for free and if something is in main memory you have to bring it into cache and then you incur a cache miss so there are two performance measures that we care about first we care about the order in any work which is just the ordinary running time of a program so this this is the same as before when we were analyzing algorithms it's just a total number of operations that the program does and the number of cache misses is going to be the number of we have to transfer between the main memory and the cache so a number of cache misses just counts the number of cash transfers whereas the work counts all the operations that you have to do in the algorithm so ideally we would like to come up with algorithms that have a low number of cache misses without increasing the work from the traditional standard algorithm sometimes we can do that sometimes we can't do that and then there's a trade-off between the work and the number of cache misses and and then you it's a trade-off that you have to decide whether it's worthwhile as a performance engineer today we're gonna look at an algorithm where you can't actually reduce a number of cache misses without increasing the work so you basically get the best of both worlds so any questions on this ideal cache model so this model is just used for analyzing algorithms you can't actually buy one of these caches at the store so this is a very ideal cache and they don't exist but it turns out that this optimal initial reply sment policy has nice theoretical properties and this is a very important lemma that has that was proved in 1985 it's called the LRU lemma it was proof by Slater and targe end and the lemma says suppose that an algorithm incurs Q cache misses on an ideal cache of size M then on a fully associative cache of size 2m that uses the LRU or least recently used replacement policy it incurs at most 2q cache misses so what this says is if I can show the number of cache misses for an algorithm on the ideal cache then if I take a fully associative cache that's twice the size and use the LRU replacement policy which is a pretty practical policy then the algorithm is going to incur at most twice the number of cache misses and the implication of this Lama is that for asymptotic analysis you can assume either the optimal replacement policy or the LRU replacement policy as convenient because because the cache number of cache misses is just going to be within a constant factor of each other so this is a very important lemma it says that this basically makes it much easier for us to analyze our cache misses and algorithms and here's a software engineering principle that I want to point out so first when you're trying to get good performance you should come up with a theoretically good algorithm that has good bounds on the work in the cache complexity and after you come up with a algorithm that's theoretically good then you start engineering for detailed performance you start worrying about the details such as real world caches not being full associative and for example loads and stores having different costs with respect to bandwidth and latency but coming up with a theoretically good algorithm is the first-order bit to getting good performance questions okay so let's start analyzing the number of cache misses in program so here's a llama so the llama says suppose that a program reads a set of our data segments where the ithe segment consists of s sub I bytes and suppose that the sum of the sizes of all the segments is equal to N and we're going to assume that n is less than M over 3 so the segments the sum of the size of the segments is less than the cache size divided by 3 we're also going to assume that n over R is greater than or equal to B so recall that R is the number of data segments we have and n is the total size of the segment so what does n over B mean and over R means semantically yes yes oh and over R is just the average size of a segment and here we're saying that the average size of a segment is at least B's at least the size of a cache line so these two assumptions hold then all of the segments are going to fit into cache and the number of cache misses to read them all is at most 3 times n over B so if you if you had just a single array of size n then the number of cache misses you would need to read that array into cache is going to be n over B and this is saying that even if our data is divided into a bunch of segments as long as an average length of the segments is large enough then a number of cache misses is just a constant factor worse than reading a single array so let's try to prove this cache miss lemma so here's a proof so a single segment S sub I is going to incur at most s sub I over B plus 2 cache misses says anyone want to tell me where the S sub I over B plus 2 comes from so this is the let's say this is a segment that we're analyzing and this is how it's aligned in virtual memory yes yeah so S sub I over B plus two is the number of blocks that could overlap with in the worst case so you need s mi over B cache missus just to load those s of I bytes but then the beginning in the end of that segment might not be perfectly aligned with a cache line boundary and therefore you could waste at most one block on each side of this segment so that's where the plus two comes from so to get the total number of cache misses we just have to sum this quantity from I equals 1 to R so if I sum s of I over B from I equals 1 to R I just get n over B by definition and then I sum 2 from idols 1 to R so that just gives me 2 R now I'm going to multiply the top and the bottom of the second term by B so to R B over B now and then that's less than or equal to n over B plus 2 and over B so where did I get this inequality here why do I know that to R B is less than or equal to 2 n yes yes you know that n is greater than or equal to BR by this assumption up here so therefore RB is less than or equal to N and then NB plus 2n be just sums up to 3 and B so in the worst case we're gonna incur 3 and over B cache misses okay so any questions on this cache miss lemma so important thing to remember here is that if you're if you have a whole bunch of data segments and the average length of your segments is large enough bigger than a cache block size that you can access all of these segments just like just like a single array it only increases the number of cache misses by a constant factor and if you're doing an asymptotic analysis then it doesn't matter so we're gonna be using this cache miss lemma later on when we analyze algorithms ok so another assumption that we're gonna need is called the tall cache assumption and the tall cache assumption basically says that the cache is taller than it is wide so says that B squared is less than cm for some sufficiently small constant C less than or equal to 1 so in other words it says that the number of cache lines and over B you have is going to be bigger than B and this tall cache assumption is usually satisfied in practice so here the cache line sizes and the cache sizes on the machines that we're using so cache line size is 64 bytes and the l1 cache size is 32 kilobytes so 64 bytes squared that's 2 to 12 and 32 kilobytes less to 215th bytes so 2 to 12 is less than 2 to 15 so it satisfies a tall cache assumption and as we go up the memory hierarchy the cache size increases but the cache line length stays the same so the cash has become even taller as we move up the memory hierarchy so let's see why this tall cache assumption is going to be useful to see that we're going to look at what's wrong with a short cache so in the short cache our lines are going to be very wide and they're wider than the number of lines that we can have in our cache and let's say we're working with an N by n sub matrix sort and row major order if you have a short cache then even if n squared is less than cm meaning that you can fit all the bytes of the sub matrix in cache you might still not be able to fit it into a short cache and this picture sort of illustrates this so we have n rows here but we can only fit M over B of the rows in the cache because the the cache lines are so long and we're actually wasting a lot of space on each of the cache lines we're only using a very small fraction of each cache line to store the row of this sub matrix if we if this were the entire matrix then it would actually be okay because consecutive rows are going to be placed together consecutively in memory but if this is a sub matrix then we can't be guaranteed that the next row is going to be placed right after the current row and oftentimes we have to deal with sub matrices when we're doing recursive matrix algorithms okay so this is what's wrong with short caches and that's why we want to assume the tall cache assumption and we can assume that because it's usually satisfied in practice the TLB actually tends to be short it only has a couple entries so it might not satisfy the tall cache assumption but all the other caches will satisfy this assumption any questions okay so here's another llama that's going to be useful this is called the sub-matrix caching llama so suppose that we have an N by n matrix a and it's read into a tall cache that satisfies B squared less than cm for some constant C less than equal to 1 and suppose that N squared is less than M over 3 but it's greater than or equal to CM then a is going to fit into cache in a number of cache misses required to read all of ADEs elements into cache is at most 3 n squared over B ok so let's see why this is true so we're gonna let big n denote the total number of bytes that we need to access so big n is going to be equal to N squared and we're gonna use the cache miss lemma which says that if the average length of our segments is large enough that we can read all the segments in just like it were a single contiguous array so the lengths of our segments here are going to be little n so R is gonna be little N and that's also the number of segments is going to be little and and the segment length is also going to be little ends since we're working with a square sub-matrix here and then we also have that the cache block size B is less than or equal to N and that's equal to big and over R and where do we get this property that B is less than or equal to n so I made some assumptions up here where I can use to infer that B is less than or equal to n does anybody see where yeah yeah so I know that B squared is less than cm cm is less than or equal to n square so therefore B squared is less than N squared in B is less than n so so now I also have that n is less than M over 3 just by assumption and therefore I can use the cache miss lemma so the cache miss lemma tells me that I only need a total of 3 N squared over B cache misses to read this whole thing in any questions on the sub-matrix hashing lemma okay so now let's analyze matrix multiplication so how many of you have seen matrix multiplication before so a couple of you okay so here's what the code looks like for the standard cubic work matrix multiplication algorithm so we have two input matrices a and B and we're gonna store the result in C and the height in the width of our matrix is n we're just going to deal with square matrices here but what I'm going to talk about also extends to non square matrices and then we just have three loops here we're going to loop through I from 0 to n minus 1 J from 0 to n minus 1 and K from 0 to n minus 1 and then we're gonna let C of I n plus J be incremented by a of I n plus K times B of K n plus J so this is just the standard code for matrix multiply so what's the work of this algorithm should be all should be review for all of you and cubed okay so now let's analyze the number of cache misses this algorithm is going to incur and again we're gonna assume that the matrix is in row major order and we satisfy the tall cache assumption we're also going to analyze the number of cache misses in matrix B because it turns out that the number of cache misses incurred by matrix B is going to dominate the number of cache misses overall in there are three cases we need to consider the first case is when n is greater than C M over B for some constant C and we're gonna analyze matrix B as I said and we're also going to assume LRU because we can if you recall the LRU lemma it says that whatever we analyzed using the LRU is just going to be a constant factor within what we analyzed using the ideal cache so so to do this matrix multiplication I'm going to go through one row of a and one column of B and do the dot product there this is what happens in the innermost loop and how many cache misses am I going to incur when I go down one column of B here so here I have the case where n is greater than M over B so I can't fit all of I can't fit one block from each row into the cache so how many cache misses do I have the first time I go down a column of B so how many rows of B do I have and yeah and how many cache misses do I need for each row one so in total I'm going to need n cache misses for the first column of B what about the second column of B so recall that I'm assuming the LRU replacement policy here so when the cache is full I'm gonna evict the thing that was least recently used used the furthest in the past sorry could you repeat that yes it's still gonna be n why is that yeah so it's still gonna be n because I can't fit one cache block from each row into my cache and by the time I get back to the top of my matrix B the top block has already been evicted from the cache and I have to load it back in and this is the same for every other block that I access I'm again gonna need n cache misses for the second column of B and this is going to be the same for all the columns of B and I have to do this again for the second row of a so in total I'm going to need theta of n cubed number of cache misses and this is one cache miss per per entry that I access in B and this is not very good because the total work was also theta of n cube so I'm not gaining anything from having any locality in this in this algorithm here so any questions on this analysis so this is just a case one let's look at case two so in this case n is less than C M over B so I can fit one block from each row of B into cache and then n is also greater than another constant C prime times square root of M so I can't fit the whole matrix into cache and again let's analyze the number of cache misses incurred by accessing be assuming LRU so how many cache misses am I going to incur for the first column of B and so the same as before what about the second column of B so by the time I get to the beginning of the beginning of the matrix here is the top block going to be in cache so who thinks the block is still gonna be in cache when I get back to the beginning yeah so a couple of people who think is gonna be out of cache okay so turns out it is going to be in cache because I can fit one block for every row of B into my since I have n less than cm over B so therefore when I get to the beginning of the second column that block is still gonna be in cash because I loaded it in when I was accessing the first column so I'm not gonna incur any cash misses for the second column and in general if if my if I can fit B columns or some constant times B columns in in into cash then I can reduce the number of cache misses I have by a factor of B so I only need to incur a cache miss the first time I access a block and not for all the subsequent accesses and the same is true for the second row of a and since I have n rows of a I'm gonna have n times theta of N squared over B cache misses for each row of a I'm gonna incur and squared over B cache misses so the overall number of cache misses is and cubed over B and this is because inside matrix B I can exploit spatial locality once I load in a block I can reuse it the next time I traverse down a column that's nearby any questions on this analysis okay so let's look at the third case and here n is less than C prime times square root of M so this means that the entire matrix fits into cache so let's analyze the number of cache misses for matrix B again assuming LRU so how many cache misses do I have now so let's let's count the total number of cache misses I have for every time I go through a row of a yes yeah so the firt for the first column is going to be n what about the second column right so for the so basically for the first row of a the analysis is going to be the same as before I need n squared over B cache misses to load the cache in what about the second row of a how many cache misses am I going to incur yeah so for the second row of a I'm not gonna incur any cash misses because once I load B into cash it's gonna stay in cash because the entire matrix can fit in cash I assume that n is less than C prime times square root of M so total number of cache misses I need for matrix B is theta of n squared over B since everything fits in cache and I just applied the sub matrix caching lemma from before overall this is not a very good algorithm because as you recall in case one I needed a cubic cubic number of cache misses what happens if I swap the order of the inner two loops so recall that this was one of the optimizations in lecture one when Charles was talking about matrix multiplication and how to speed it up so if I swap the order of the two inner loops then then for every iteration what I'm doing is I'm actually going over a row of C in a row of B and a stays fixed inside the innermost iteration so now when I analyze the number of cache misses of matrix B assuming LRU I'm gonna benefit from spatial locality since I'm going row by row and the matrix is stored in row major order so across all of the rows I'm just going to require theta of n squared over B cache misses and I have to do this end times for the outer loop so in total I'm going to get theta of n cubed over B cache misses so if you swap the order of the inner two loops this significantly improves the locality of your algorithm and you can benefit from spatial locality and that's why we saw a significant performance improvement in the first lecture when we swapped the order of the loops any questions does anybody think we can do better than n cubed over B cache misses or you think that it's the best you can do so how many people think you can do better yeah and how many people think this is the best you can do okay and how many people don't care so turns out you can do better and we're gonna do better by using an optimization called tiling so how this is gonna work is instead of just having three for loops I'm gonna have six for loops and I'm gonna loop over tiles so I've got loop over s by s sub matrices and within each sub makes listing I'm gonna do all the computation I need for that sub matrix before moving on to the next sub matrix so the three innermost loops are going to loop inside a sub matrix and the three outermost loops are going to loop within the larger matrix one sub matrix at a time so let's analyze the work of this algorithm so the work that we need to do for a sub matrix of size s by s is going to be s cubed since that's just a bound for matrix multiplication and then a number of times I have to operate on sub matrices is going to be n over s cubed and you can see this if you just consider each sub matrix to be a single element and then using the same cubic work analysis on the smaller matrix so the work is n over s cubed times s cubed which is equal to theta n cube so the work of this tiled matrix multiplies the same as the version that didn't do tiling and now let's analyze the number of cache misses so we're gonna tune s so that the sub-matrices just fit into cash so we're gonna set s to be equal to theta of square root of M who actually need to make this like one-third square root of M because we need to fit three sub matrices in the cache but it's going to be some constant times square root of M the sub-matrix cashing leva implies that for each sub matrix we're gonna need s squared over B misses to load it in and once we load it into cash it fits entirely into cache so we can do all of our computations within cash and not incur any more cache misses so therefore the total number of cache misses we're going to incur it's going to be the number of subproblems which is n over s cubed and then the number of cache misses per subproblem which is s squared over B and if you multiply this out you're gonna get and cubed over B times square root of M so here I plugged in square root of M for s and this is a pretty cool result because it says that you can actually do better than the n cubed over B bound you can improve this bound by a factor of square root of M and in practice square root of M is actually not insignificant so for example if you're looking at the last level cache the size of that is on the order of megabytes so a square root of M is going to be on the order of thousands so this significantly improves the performance of the matrix multiplication code if you tune s so that the sub matrices just fit in the cache it turns out that this bound is off to most so this was shown in 1981 so for cubic work matrix multiplication this is the best you can do if you use another matrix multiply algorithm like strassens algorithm you can do better so I want you to remember this bound it's a very important bound to know says that for matrix multiplication you can you can benefit both from spatial locality as well as temporal locality so I get spatial locality in the denominator the B term in the denominator and then the square root of M term comes from temporal locality since I'm doing all of the work inside the sub-matrix before I evict that sub matrix from cache any questions on this analysis so what's one issue with this algorithm here yes yeah so the problem here is I have two two nests for my particular machine and I call this a voodoo parameter it's sort of like a magic number I I put into my program so that it fits in the cache on the particular machine I'm running on and this makes the code not portable because when I try if I try to run this code on another machine the cache sizes might be different there and then I won't get the same performance as I did on my machine and this is also an issue even if you're running it on the same machine because you might have other programs running at the same time and using a part of the cache so you don't actually know how much of the cache your program actually gets to use in a multi programming environment and then this was also just for one level of cache if we want to optimize for two levels of caches we're gonna have to voodoo parameters s and T we're gonna have sub matrices and sub sub matrices and then we have to tune both of these parameters to get the best performance on our machine and multi-dimensional tuning optimization can't be done simply with binary search so if you're just tuning for one level of cache you can do a binary search on the parameter s but here you can't do binary search so so it's much more expensive to optimize here and the code becomes a little bit Messier you have nine four loops instead of six and how many levels of caches do we have on the machines that we're using today three so four three level cache you have three voodoo parameters you have 12 nested for-loops this code becomes very ugly and you have to tune these parameters for your particular machine and this makes the code not very portable as one student pointed out and in a multi programmed environment you don't actually know the effective cache size that your program has access to because other jobs are running at the same time and therefore it's very easy to miss - in the parameters with their question so I need questions yeah yes so you I mean you can auto tune your program so that it's optimized for the cache sizes of your particular machine instruction to get the size of your cache I'm not actually sure do you know yeah in the rock [Music] yeah proxy pu info yeah yes so you can probably get that as well yeah yeah but even if you do that and you're running this program when other jobs are running you don't actually know how much cash your program has access to yes no it's they're actually general-purpose we're actually I mean today we're just looking at matrix multiply but in on Thursday's lecture we'll actually be looking at many other problems and how to optimize them for the cache hierarchy other questions okay so this was a I mean this was a good algorithm in terms of cache performance but it wasn't very portable so let's see if we can do better let's see we can come up with a simpler design where we still get pretty good cache performance so we're going to turn to dividing conquer we're going to look at the recursive matrix multiplication algorithm that we saw before again we're going to deal with square matrices but the results generalize to non square matrices so how this works is we're going to split our input matrices into four sub sub matrices or four quadrants and then for each quadrant of the output matrix it's just going to be the sum of two matrix multiplies on n over two by n over two matrices so c1 1 is going to be a1 1 times b1 1 plus a12 times b1 b2 1 and then we're gonna do this recursively so every level of recursion we're gonna get eight multiply adds of n over two by n over two matrices here's what the recursive code looks like you can see that we have eight recursive calls here the base case here is of size 1 in practice you want a course in the base case to overcome function call overheads let's also look at what these values here correspond to so I've color-coded these that they correspond to particular elements in the sub matrix that I'm looking at on the right so these values here corresponds to the index of the first element in each of my quadrants so the first element in my first quadrant is just gonna have an offset of zero and then the first element of my second quadrant that's gonna be on the same row as the first element in my first quadrant so I I just need to add a difference between the I just need to add the width of my quadrant which is n over 2 and then to get the first element in quadrant 2 1 I'm going to jump over and over to rows and each row has a length row size so it's just gonna be an over two times row size and then to get the first element in Quadrant 2 2 it's just the first element in Quadrant 2 1 + + / 2 so that's n over 2 times row size plus 1 so let's analyze the work of this algorithm so what's the recurrence for this algorithm for the work of this algorithm so how many subproblems do we have 8 and what's the size of each subproblem n over 2 and how much work are we doing to set up the recursive calls constant amount of work so the recurrence is w of n is equal to 8 w n over 2 plus theta of 1 and what does that solve to and cubed so it's one of the three cases of the master theorem we're actually going to analyze this in more detail using by drawing out the recursion tree and this is going to give us more intuition about why the master theorem is true so at the top level of my recursion tree I'm gonna have a problem of size N and then I'm gonna branch into 8 sub problems of size n over 2 and then I'm gonna do a constant amount of work to set up the recursive calls here I'm just labeling this with one so I'm ignoring the constants but it's not gonna matter for asymptotic analysis and then I'm gonna branch again into 8 sub problems of size and over four and eventually I'm gonna get down to the leaves and how many levels do I have until I get to the leaves yes yes o log n what's the base of the log yes it's log base 2 of n because I'm dividing my problem size by two every time and therefore the number of leaves I have is going to be 8 to the log base 2 of n because I'm branching 8 ways every time 8 to the log base 2 of n is the same as n to the log base 2 of 8 which is n cubed the amount of work I'm doing at the top level is constant so I'm just gonna say one here the next log was eight times that then 64 and then when I get to the leaves it's going to be theta of n cubed since I have n cubed leaves and they're all doing constant work and the work is geometrically increasing as I go down the recursion tree so the overall work is just dominated by the work I need to do at the leaves so the overall work is just gonna be theta n cubed and this is the same as the looping versions of matrix multiply they're all cubic work now let's analyze the number of cache misses of this dividing conquer algorithm so now a migrant recurrence is gonna be different my base case now is when the sub matrix fits in the cache so when n squared is less than CM and when that's true I just need to load that some matrix in the cache and then I don't incur any more cache misses so I need theta of N squared over the cache misses when N squared is less than CM for some sophisticated sufficiently small constant C less than equal to 1 and then otherwise i recurse into 8 sub problems of size n over 2 and then I add theta of 1 because I'm doing a constant amount of work to set up the recursive calls and I get this theta of N squared over B term from the sub matrix cashing lemma says I can just load the entire matrix into cache with this many cache misses so so the difference between the cache analysis here and the work analysis before is that I have a different base case and I think in all of that that you've seen before the base case was always of a constant size but here we're working with a base case that's not of a constant size so let's try to analyze this using the recursion tree approach so at the top level I have a problem of size n then I'm going to branch and do eight problems of size n over two and then I'm also going to incur constant number of cache misses I'm just gonna say one here then I'm gonna branch again and then eventually I'm going to get to the base case where N squared is less than CM and when N squared is less than CM then the number of cache misses that I'm gonna incur is going to be theta of CM over B so I can just plug in CM here for n squared and the number of levels of recursion I have in this recursion tree is no longer just log base two of n I'm gonna have log base two of n minus log base two of square root of CM number of levels which is the same as log base two of n minus 1/2 times log base two of CM and then the number of leaves I get is going to be eight to this to this number of levels here says 8 to log base two of n minus 1/2 of log base two of cm and this is equal to theta of n cubed over m to the three halves so the n cubed comes from the eight to the log base two of n term and then if I do eight to the log base 8 to the negative one half of log base two of CM that's just gonna give me m over m to the three halves in the denominator so any questions on how I computed the number of levels of this recursion tree here so I'm basically dividing my problem size by two until I get to a problem size that fits into cash so that means n is less than square root of CM so therefore I can subtract that many levels for my recursion tree and then to get the number of leaves since I'm branching eight ways I just do eight to the power of the number of levels I have and then that gives me the total number of leaves so now let's analyze the number of cache misses I need at each level of this recursion tree at the top level I have a constant number of cache misses let's just say one the next level I have eight sixty-four and then at the leaves I'm going to have theta of n cubed over B times square root of M cache misses and I got this quantity just by multiplying the number of leaves by the number of cache misses per leaf so number of leaves is n cubed over m to the three halves the cache misses per leaves is state of CM over B so I lose one factor of B in the denominator I'm left with the square root of M at the bottom and then I also divide by the block size B so overall I get n cubed over B times square root of M cache misses and again this is a geometric series and the number of cache misses at the leaves dominates all of the other levels so the total number of cache misses I have is going to be theta n cubed over B times square root of M and notice that I'm getting the same number of cache misses as I did with the tiling version of the code but here I don't actually have the two in my code for the particular cache size so what cache sizes does this code work for so is this code gonna work on your machine it's gonna get good cache performance so this code is gonna work for all cache sizes because I didn't tune it for any particular cache size and this is what's known as a cache oblivious algorithm it doesn't have any Voodoo tuning parameters has no explicit knowledge of the caches and it's essentially passively Auto tuning itself for the particular cache size of your machine it can also work for multi-level caches automatically because I never specified what level of cache I'm analyzing this for I can analyze it for any level of cache and it's still gonna give me good cache complexity and this is also good in multi programmed environments where you might have other jobs running and you don't know your effective cache size this is just gonna pass all the auto-tune for whatever cache size is available turns out that the best cache oblivious codes today work on arbitrary rectangular matrices I just talked about square matrices but the best codes work on rectangular matrices and they perform binary splitting instead of eight-way splitting and you split on the largest of I J and K so this is what the best cache oblivious matrix multiplication algorithm does any questions okay so I only talked about the serial setting so far I was assuming that these algorithms ran on just a single thread what happens if I go to multiple processors turns out that the results do generalize to a parallel context so this is the recursive parallel matrix-multiply code that we saw before and notice that we're executing four sub calls in parallel doing a sync and then doing four more sub calls in parallel so let's try to analyze the number of cache misses in this parallel code and to do that we're gonna use this theorem which says that let Q sub P be the number of cache misses in a deterministic cell computation won't run on P processors each with a private cache of size M and let S sub P be the number of successful steals during the computation in the ideal cache model the number of cache misses we're gonna have is Q sub P equal to Q sub 1 plus Big O of number of steals times M over B so a number of cache misses in the parallel context is equal to a number of cache misses when you run it serially plus this term here which is the number of steals times M over B and a proof for this goes as fall so every call in the silk runtime system we can have workers steal tasks from other workers when they don't have work to do and after workers steals it has from another worker its cache becomes completely cold in the worst case because it wasn't actually working on that sub problem before but after M over B cold cache misses its cache is going to become identical to what it would be in the serial execution so we just need to pay em over B cache misses to make make it so that the cache looks the same as if they were executing serially and the same is true when a worker resumes a stolen computation after a silk sync and the number of times that these two situations can happen is two times SP two times the number of steals and each time we have to pay I'm over B cache misses and this is where this additive term comes from order S sub P times M over B we also know that the number of steals in a silk program is upper bounded by P times T infinity an expectation where P is a number of processors and T's infinity is the span of your computation so if you can minimize the span of your computation then this also gives you good cache bounds so moral of the story here is that minimizing the number of cache misses in the serial illusion essentially minimizes them in the parallel execution for a low span algorithm so in this recursive matrix multiplication algorithm the span of this is as follows so T infinity of n is 2 T infinity of n over 2 plus theta of 1 since we're doing a sync here we have to pay the critical path length of two sub calls this solves two theta of N and applying the previous lemma this gives us a cache miss bound of theta of n cubed over B square root of M this is just the same as the serial execution and then this additive term is going to be order P times n and there's a span times M over B so that was a parallel that was a parallel algorithm for matrix multiply and we saw that we can also get good cache bounds there so here's a summary of what we talked about today we talked about Sousa tivity and caches different different ways you can design a cache talked about the ideal cache model that's useful for analyzing algorithms we talked about cache aware algorithm that have explicit knowledge of the cache and the example we use was tiled matrix multiply then we came up with a much simpler algorithm that was cache oblivious using dividing conquer and then on Thursdays lecture will actually see much more on cache oblivious algorithm designed and then you'll an opportunity to analyze the cash efficiency of some algorithms in the next homework 2 00:00:03,189 --> 00:00:05,769 the following content is provided under a Creative Commons license your support will help MIT OpenCourseWare continue to offer high quality educational resources for free to make a donation or to view additional materials from hundreds of MIT courses visit MIT opencourseware at ocw.mit.edu all right so we've talked a little bit about caching before but today we're going to talk in much more detail about caching and how to design cache efficient algorithms so first let's look at the caching Hardware on modern machines today so here's what the cache hierarchy looks like for a multi-core chip we have a whole bunch of processors they're all they all have their own private l1 caches for both the data as well as the instruction they also have a private l2 cache and then they share a last level cache or l3 cache which is also called LLC they're all connected to a memory controller that can access DRAM and then oftentimes you'll have multiple chips on the same same server and these chips would be connected through a network so here we have a bunch of multi-core chips that are connected together so we can see that there are different levels of memory here and the sizes of each one of these levels of memory is different so the sizes tend to go up as you move up the memory hierarchy the l1 caches tend to be about 32 kilobytes in fact these are the specifications for the machines that you're using in this class so 32 kilobytes for both the l1 data cache in the l1 instruction cache 256 kilobytes for the l2 cache so the l2 cache tends to be about 8 to 10 times larger than l1 cache and then the last level cache the size is 30 megabytes so this is typically on the order of tens of megabytes and then DRAM is on the order of Giga bytes so we have 128 gigabyte DRAM and nowadays you can actually get machines that have terabytes of DRAM so the associativity tends to go up as you move up the cache are key and I'll talk more about social scientific on the next couple slides the time to access the memory also tends to go up so the latency tends to go up as you move up the memory hierarchies so the l1 caches are the quickest to access so about 2 nanoseconds you're just rough numbers the l2 cache is a little bit slower so say for nanoseconds last level cache maybe 6 nanoseconds and then when you have to go to DRAM it's about an order of magnitude slower so 50 nanoseconds in this example and the reason why the memories are further down in the cache mark you're faster is because they're using more expensive materials to manufacture these things but since they tend to be more expensive we can't fit as much of that on the machines so that's why the faster memories are smaller than the slower memories but if we're able to take advantage of locality in our programs and we can make use of the fast memory as much as possible we'll talk about ways to do that in this lecture today there's also the latency across the network which tends to be cheaper than going to main memory but slower than doing a last level cache access and there's a lot of work in trying to get the cache coherence protocols right as we mentioned before so since these processors all have private caches we need to make sure that they all see a consistent view of memory when they're trying to access the same memory addresses in parallel so we talked about the MSI cache protocol before and there are many other protocols out there and you can read more about these things online but these are very hard to get right and there's a lot of verification involved in trying to prove that the cache coherence protocols are correct so any questions so far okay so let's talk about the associativity of a cache so here I'm showing you a fully associative cache and in a fully associative cache a cache block can reside anywhere in the cache and the basic unit of movement here is a cache block in this example the cache block size is 4 bytes but on the on the machines that we're using for this class the cache block size is 64 bytes but for this example I'm going to use a 4 byte cache line so each row here corresponds to one cache line in a fully associative cache means that each line here can go anywhere in the cache and then here we're also assuming a cache size that has 32 bytes so in total it can store a cache lines since the cache line is 4 bytes so to find a block in a fully associative cache you have to actually search the entire cache because a cache line can appear anywhere in the cache and there's a tag associated with each of these cache lines here that that basically specify which of the memory addresses in virtual memory space it corresponds to so for the fully associative cache we're actually gonna use most of the bits of the address as a tag we don't actually need the two lower order bits because the things are being moved that degree and the granularity of cache lines which are 4 bytes so the two lower order bits are always going to be the same but we're just gonna use the rest of the bits to store the tag so if our address space is 64 bits then we're gonna use 62 bits to store the tag in a fully associative cache in scheme and when a cache becomes full a block that has to be victim a chrome for a new block and there are various ways that you can decide how to evict a block so this is known as the replacement policy one common replacement policy is LRU or least recently used so you basically kick the thing out that has been used the farthest in the past dollar schemes such as second chance and clock replacement we're not going to talk too much about the different replacement schemes today but you can feel free to read about these things online so what's a disadvantage of this scheme yes yeah why is it slow yeah so so the disadvantage is that searching for a cache line in the cache can be pretty slow because you have to search the entire cache in the worst case since a cache block can reside anywhere in the cache so even though the search can go on in parallel and hardware still expensive in terms of power and performance to have to search most of the cache every time so let's look at another extreme this is a direct mapped cache so in a direct mapped cache each cache block can only go in one place in the cache so I've color-coded these cache blocks here so the red blocks can only go in the first row of this cache the orange ones can only go in the second row and so on and the position which cache block can go into is known as that cache blocks set so the set determines the location in the cache for each particular block so let's look at how the virtual memory address is divided up into and which of the bits we're gonna use to figure out where a cache block should go in the cache so we have the offset we have the set and then the tag fields the offset just tells us which position we want to access within a cache block so since a cache block has B bytes we only need log base two of B bits that's offset and the reason why we have to also is because we're not always accessing something at the beginning of a cache block we might want to access something in the middle and that's why we need to offset to specify where in the cache block we want to access then there's a set field and the set field is going to determine which position in the cache that cache block can go into so there are eight possible positions for each cache block here and therefore we only need log base two of eight bits so three bits for the set in this example and more generally it's going to be log base 2 of M over B and here M over B is eight and then finally we're gonna use the remaining bits as a tag so W minus log base 2 of M bits for the tag and that gets stored along with the cache block in the cache and that's going to uniquely identify which of the which of the memory blocks the the cache block corresponds to in virtual memory and you can verify that the sum of all of these quantities here sums to W bits so in total we have a W bit address space and the sum of those three things is W so what's the advantage and disadvantage of this scheme so first what's the what's a good thing about this scheme compared to the previous scheme that we saw yes yes fast because you only have to check one place because each cache block can only go in one place in the cache and that's the only place you have to check when you try to do a lookup if the cache block is there then you find it if it's not then you know it's not in the cache what's the downside to this scheme yeah yeah so it's a good answer so the downside is that you might for example or just be accessing the rad cache blocks and then not accessing any of the other cache blocks they'll all get mapped to the same location in the cache and then they'll keep keep evicting each other even though there's a lot of empty space in the cache and this is known as a conflict miss and these can be very bad for performance and very hard to debug so that's one downside of a direct mapped cache is that you can get these conflict misses where you have to evict things from the cache even though there's empty space in the cache so as we said finding a block is very fast only a single location in the cache has to be searched but you might suffer from conflict misses if you keep accessing things in the same set repeatedly without accessing the things and the other sets so any questions okay so these are sort of the two extremes for cash design there's actually a hybrid solution called set associative cache and in a set associative cache you still have sets but each of the sets contains more than one line now so all the red blocks still map to the red set but there's actually two possible locations for the red blocks now so in this case this is known as a 2-way associative cache since there are two possible locations inside each set and again a cache blocks set determines K possible cache locations for that block so within a set it's fully associative but each block can only go in one of the sets so let's look again at how the bits are divided into in the address so we still have the tag set and offset fields the offset field is still log base 2 of B the set field is going to take log base 2 of M over KB bits so the number of sets we have is M over KB so we need log base 2 of that that number to represent the set of a block and then finally we use the remaining bits as a tag so it's going to be W minus log base 2 of M over K and now to find a block in the cache only K locations of the of its set must be searched so you basically find which set the cache block maps to and then you check all K locations within that set to see if that cache block is there and whenever you want to whenever you try to put something in the cache because it's not there you have to evict something and you've X something from the same set as the block that you're placing into the cache so for this example I showed a 2-way associative cache but in practice the associativity is usually bigger say 8 Way 16 way or sometimes 20 way and and as you keep increasing the associativity it looks it's gonna look more and more like a fully associative cache and if you have a one-way associative cache then it's just a direct mapped cache so this is a sort of a hybrid in-between a fully mapped cache and a fully associative cache in a direct mapped cache so any questions on set associative caches okay so let's go over a taxonomy of different types of cache misses that you can incur so the first type of cache miss is called a cold miss and this is a cache miss that you have to incur the first time you access a cache block and if you need to access this piece of data there's no way to get around getting a cold miss for this because your as cache starts out not having this block and the first time you access it you have to bring it into cache then there are capacity misses so capacity misses our cache misses you get because the cache is full and it can't fit all of the cache blocks that you want to access so you get a capacity miss when the previous cache copy would have been you victim 'van with a fully associative scheme so even if all of the possible locations in your cache could be used for a particular cache line that cache line still has to be evicted because there's not enough space so that's what's called a capacity miss and you can deal with capacity misses by introducing more locality into your code both spatial in temporal locality and we'll look at ways to reduce the capacity misses of algorithms later on in this lecture then Darr conflict misses and conflict misses happen in a set associative caches when you have too many blocks from the same set mapping to the wanting to go into the cache and some of these have to be evicted because the sack can't fit all of the blocks and these blocks wouldn't have been you Victor do had a fully associative schemes these are what's called conflict misses you have to for example if you have 16 things in a set and you keep accessing 17 things that all belong in the set something's gonna get kicked out every time you want to access something and these cache evictions might not have happened if you had a fully associative cache and then finally there are sharing misses so sharing misses only happened in a parallel context and we talked a little bit about - sharing a false sharing misses in Prior lectures so let's just review this briefly so sharing this can happen if multiple processors are accessing the same cache line and at least one of them is writing to that cache line if all the processors are just reading from the cache line then the cache coherence protocol knows how to make it work so that you don't get cache misses they can all access the same cache line at the same time if nobody's modifying it but if at least one processor is modifying it you could get either true sharing misses or false sharing misses so a true sharing miss is when two processors are accessing the same data on the same cache line and as you recall from a previous lecture if one of the two processors is writing to this cache line whenever it does a write in that it needs to acquire the cache line in exclusive mode and then invalidate the that cache line in all other caches so then when when another processor tries to access the same memory location has to bring it back into its own cache and then you get a cache miss there a false sharing miss happens if two processes are accessing different data that just happened to reside on the same cache line because the basic unit of movement is a cache line in the architecture so even if you're accessing different things if they are on the same cache line you're still going to get a sharing miss and fall sharing is pretty hard to deal with because in general you don't know what data gets placed on what cache line there's certain heuristics you can use for example of your mouth looking a big we region you know that that memory region is contiguous so you can space your access is far enough apart by different processors so they don't touch the same cache line so if you're just declaring local variables on the stack you don't know where the compiler is going to decide to place these variables in the virtual memory address space so these are for different types of cache misses that you should know about and there's many models out there for analyzing the cache performance of algorithms and some of the models ignore some of these different types of cache misses so just be aware of this when you're looking at algorithm analysis because not all the models will capture all of these different types of cache misses so let's look at a bad case for conflict misses so here I'm I have a I want to access a sub matrix within a larger matrix and recall that the matrices are stored in row major order and let's say our matrix is 4096 columns by 4096 rows and it still stores doubles so therefore each row here is going to contain two to the fifteenth bytes because 4096 is 2 to the 12 and we have doubles each one of which takes 8 bytes so 2 to 12 times 2 to the third which is 2 to 15th we're gonna assume the word width is 64 which is standard we're gonna assume that we have a cache size of 32 K and the cache block size is 64 which again is standard and let's say we have a four-way associative cache so let's look at how the how the bits are divided into so again we have this offset which takes log base 2 of B bits so how many bits do we have for the offset in this example all right so we have six bits so it's just log base 2 of 64 what about for this set how many bits do we have for that seven who said seven yeah so so it is seven so M is 32 K which is to 215th and then k is 2 to the 2 b is 2 to the 2 to the 6 so it's 2 to 15 divided by 2 the 8 which is 2 to 7th and log base 2 of that is 7 and finally what about the tag field 51 why is that yes it's just 64 minus seven minus six which is 51 okay so let's say that we want to access a sub matrix within this larger matrix let's say we want to access a 32 by 32 sub matrix and this is pretty common in matrix algorithms where you want to access sub matrices especially in dividing conquer algorithms and let's say we want to access a column of this sub matrix a so the addresses of the elements that we're going to access are as follows so let's say we the first element in a column is sort out address X then the second element in the column is going to be stored at address X plus two to the fifteenth because each row has two to the fifteenth bytes and we're skipping over an entire row here to get to the element in the second in the next row of the sub matrix so we're gonna add to the fifteenth and then to get the third element we're gonna add two times two to the fifteenth and so on until we get to the last element which is X plus 31 times two to the fifteenth so which the olds of the address are changing as we go through one column of this sub matrix yeah yeah so it's just gonna be the tag that's changing the set and the offset are gonna stay the same because we're just using the lower 13 bits to store the setna tag and therefore when we increment by to the 15th we're not going to touch the set in the offset so all of these addresses fall into the same set and this is a problem because ours us our cache is only 4-way associative so we can only fit for cash for cache lines in each set and here we're accessing 31 of these things so by the time we get to the next column of a all the things that we access in the current column of a are going to be evicted from cache already and this is known as a conflict miss because if if you had a fully associative cache this might not have happened because you could actually use any location in the cache to store these cache blocks so is anybody have any questions on why we get conflict misses here so anybody have any ideas on how to fix this so what can I do to make it so that I'm not incrementing by to the exactly to the 15th every time yeah yeah yeah yeah so one solution is to Pat the matrix you can add some constant amount of space to the end of the matrix so each row is gonna be longer than to the 15th bytes so maybe you add some small constant like 17 so add 17 bytes the end of each row and now when you act as a column of this sub matrix you're not just incrementing by to the 15th you're also adding some small integer and that's going to cause the set and offset fields to change as well and you're not going to get as many conflict misses so that's one way to solve the problem it turns out that if you're doing a matrix multiplication algorithm that's a cubic work algorithm and you can basically afford to copy the sub matrix into a temporary 32 by 32 matrix do all the operations on the temporary matrix and then copy it back out to the original matrix the copy copying only takes quadratic work to do across the whole algorithm and since the whole algorithm takes cubic work the quadratic work is a lower order term so you can also you can use temporary space to make sure that you don't get conflict misses any questions okay so this was conflict misses so conflict misses are important but usually we're going to be first concerned about getting good spatial and temporal locality because those are usually the higher-order factors in the performance of a program once we get good spatial and temporal locality in our program we can then start worrying about conflict misses for example by using temporary space or padding or our data by some small constants so that we don't have as many conflict misses okay so now I want to talk about a model that we can use to analyze the cache performance of algorithms and this is called the ideal cache model so in this model we have a two level cache hierarchy so we have the cash and then main memory the cache size is of size M and the cache line sizes of B bytes and therefore we can fit M over B cache lines inside our cache this model assumes that the cache is fully associative so any cache block can go anywhere in the cache and it also assumes an optimal Nishant replacement policy so this means that what we want to evict a cache block from the cache we're going to pick the thing to evict that gives us the best performance over all it gives us the lowest number of cache misses throughout our entire algorithm so we're assuming that we know the sequence of memory requests throughout the entire algorithm and that's why it's called the nisshin replacement policy and if something is in cache you can operate on it for free and if something is in main memory you have to bring it into cache and then you incur a cache miss so there are two performance measures that we care about first we care about the order in any work which is just the ordinary running time of a program so this this is the same as before when we were analyzing algorithms it's just a total number of operations that the program does and the number of cache misses is going to be the number of we have to transfer between the main memory and the cache so a number of cache misses just counts the number of cash transfers whereas the work counts all the operations that you have to do in the algorithm so ideally we would like to come up with algorithms that have a low number of cache misses without increasing the work from the traditional standard algorithm sometimes we can do that sometimes we can't do that and then there's a trade-off between the work and the number of cache misses and and then you it's a trade-off that you have to decide whether it's worthwhile as a performance engineer today we're gonna look at an algorithm where you can't actually reduce a number of cache misses without increasing the work so you basically get the best of both worlds so any questions on this ideal cache model so this model is just used for analyzing algorithms you can't actually buy one of these caches at the store so this is a very ideal cache and they don't exist but it turns out that this optimal initial reply sment policy has nice theoretical properties and this is a very important lemma that has that was proved in 1985 it's called the LRU lemma it was proof by Slater and targe end and the lemma says suppose that an algorithm incurs Q cache misses on an ideal cache of size M then on a fully associative cache of size 2m that uses the LRU or least recently used replacement policy it incurs at most 2q cache misses so what this says is if I can show the number of cache misses for an algorithm on the ideal cache then if I take a fully associative cache that's twice the size and use the LRU replacement policy which is a pretty practical policy then the algorithm is going to incur at most twice the number of cache misses and the implication of this Lama is that for asymptotic analysis you can assume either the optimal replacement policy or the LRU replacement policy as convenient because because the cache number of cache misses is just going to be within a constant factor of each other so this is a very important lemma it says that this basically makes it much easier for us to analyze our cache misses and algorithms and here's a software engineering principle that I want to point out so first when you're trying to get good performance you should come up with a theoretically good algorithm that has good bounds on the work in the cache complexity and after you come up with a algorithm that's theoretically good then you start engineering for detailed performance you start worrying about the details such as real world caches not being full associative and for example loads and stores having different costs with respect to bandwidth and latency but coming up with a theoretically good algorithm is the first-order bit to getting good performance questions okay so let's start analyzing the number of cache misses in program so here's a llama so the llama says suppose that a program reads a set of our data segments where the ithe segment consists of s sub I bytes and suppose that the sum of the sizes of all the segments is equal to N and we're going to assume that n is less than M over 3 so the segments the sum of the size of the segments is less than the cache size divided by 3 we're also going to assume that n over R is greater than or equal to B so recall that R is the number of data segments we have and n is the total size of the segment so what does n over B mean and over R means semantically yes yes oh and over R is just the average size of a segment and here we're saying that the average size of a segment is at least B's at least the size of a cache line so these two assumptions hold then all of the segments are going to fit into cache and the number of cache misses to read them all is at most 3 times n over B so if you if you had just a single array of size n then the number of cache misses you would need to read that array into cache is going to be n over B and this is saying that even if our data is divided into a bunch of segments as long as an average length of the segments is large enough then a number of cache misses is just a constant factor worse than reading a single array so let's try to prove this cache miss lemma so here's a proof so a single segment S sub I is going to incur at most s sub I over B plus 2 cache misses says anyone want to tell me where the S sub I over B plus 2 comes from so this is the let's say this is a segment that we're analyzing and this is how it's aligned in virtual memory yes yeah so S sub I over B plus two is the number of blocks that could overlap with in the worst case so you need s mi over B cache missus just to load those s of I bytes but then the beginning in the end of that segment might not be perfectly aligned with a cache line boundary and therefore you could waste at most one block on each side of this segment so that's where the plus two comes from so to get the total number of cache misses we just have to sum this quantity from I equals 1 to R so if I sum s of I over B from I equals 1 to R I just get n over B by definition and then I sum 2 from idols 1 to R so that just gives me 2 R now I'm going to multiply the top and the bottom of the second term by B so to R B over B now and then that's less than or equal to n over B plus 2 and over B so where did I get this inequality here why do I know that to R B is less than or equal to 2 n yes yes you know that n is greater than or equal to BR by this assumption up here so therefore RB is less than or equal to N and then NB plus 2n be just sums up to 3 and B so in the worst case we're gonna incur 3 and over B cache misses okay so any questions on this cache miss lemma so important thing to remember here is that if you're if you have a whole bunch of data segments and the average length of your segments is large enough bigger than a cache block size that you can access all of these segments just like just like a single array it only increases the number of cache misses by a constant factor and if you're doing an asymptotic analysis then it doesn't matter so we're gonna be using this cache miss lemma later on when we analyze algorithms ok so another assumption that we're gonna need is called the tall cache assumption and the tall cache assumption basically says that the cache is taller than it is wide so says that B squared is less than cm for some sufficiently small constant C less than or equal to 1 so in other words it says that the number of cache lines and over B you have is going to be bigger than B and this tall cache assumption is usually satisfied in practice so here the cache line sizes and the cache sizes on the machines that we're using so cache line size is 64 bytes and the l1 cache size is 32 kilobytes so 64 bytes squared that's 2 to 12 and 32 kilobytes less to 215th bytes so 2 to 12 is less than 2 to 15 so it satisfies a tall cache assumption and as we go up the memory hierarchy the cache size increases but the cache line length stays the same so the cash has become even taller as we move up the memory hierarchy so let's see why this tall cache assumption is going to be useful to see that we're going to look at what's wrong with a short cache so in the short cache our lines are going to be very wide and they're wider than the number of lines that we can have in our cache and let's say we're working with an N by n sub matrix sort and row major order if you have a short cache then even if n squared is less than cm meaning that you can fit all the bytes of the sub matrix in cache you might still not be able to fit it into a short cache and this picture sort of illustrates this so we have n rows here but we can only fit M over B of the rows in the cache because the the cache lines are so long and we're actually wasting a lot of space on each of the cache lines we're only using a very small fraction of each cache line to store the row of this sub matrix if we if this were the entire matrix then it would actually be okay because consecutive rows are going to be placed together consecutively in memory but if this is a sub matrix then we can't be guaranteed that the next row is going to be placed right after the current row and oftentimes we have to deal with sub matrices when we're doing recursive matrix algorithms okay so this is what's wrong with short caches and that's why we want to assume the tall cache assumption and we can assume that because it's usually satisfied in practice the TLB actually tends to be short it only has a couple entries so it might not satisfy the tall cache assumption but all the other caches will satisfy this assumption any questions okay so here's another llama that's going to be useful this is called the sub-matrix caching llama so suppose that we have an N by n matrix a and it's read into a tall cache that satisfies B squared less than cm for some constant C less than equal to 1 and suppose that N squared is less than M over 3 but it's greater than or equal to CM then a is going to fit into cache in a number of cache misses required to read all of ADEs elements into cache is at most 3 n squared over B ok so let's see why this is true so we're gonna let big n denote the total number of bytes that we need to access so big n is going to be equal to N squared and we're gonna use the cache miss lemma which says that if the average length of our segments is large enough that we can read all the segments in just like it were a single contiguous array so the lengths of our segments here are going to be little n so R is gonna be little N and that's also the number of segments is going to be little and and the segment length is also going to be little ends since we're working with a square sub-matrix here and then we also have that the cache block size B is less than or equal to N and that's equal to big and over R and where do we get this property that B is less than or equal to n so I made some assumptions up here where I can use to infer that B is less than or equal to n does anybody see where yeah yeah so I know that B squared is less than cm cm is less than or equal to n square so therefore B squared is less than N squared in B is less than n so so now I also have that n is less than M over 3 just by assumption and therefore I can use the cache miss lemma so the cache miss lemma tells me that I only need a total of 3 N squared over B cache misses to read this whole thing in any questions on the sub-matrix hashing lemma okay so now let's analyze matrix multiplication so how many of you have seen matrix multiplication before so a couple of you okay so here's what the code looks like for the standard cubic work matrix multiplication algorithm so we have two input matrices a and B and we're gonna store the result in C and the height in the width of our matrix is n we're just going to deal with square matrices here but what I'm going to talk about also extends to non square matrices and then we just have three loops here we're going to loop through I from 0 to n minus 1 J from 0 to n minus 1 and K from 0 to n minus 1 and then we're gonna let C of I n plus J be incremented by a of I n plus K times B of K n plus J so this is just the standard code for matrix multiply so what's the work of this algorithm should be all should be review for all of you and cubed okay so now let's analyze the number of cache misses this algorithm is going to incur and again we're gonna assume that the matrix is in row major order and we satisfy the tall cache assumption we're also going to analyze the number of cache misses in matrix B because it turns out that the number of cache misses incurred by matrix B is going to dominate the number of cache misses overall in there are three cases we need to consider the first case is when n is greater than C M over B for some constant C and we're gonna analyze matrix B as I said and we're also going to assume LRU because we can if you recall the LRU lemma it says that whatever we analyzed using the LRU is just going to be a constant factor within what we analyzed using the ideal cache so so to do this matrix multiplication I'm going to go through one row of a and one column of B and do the dot product there this is what happens in the innermost loop and how many cache misses am I going to incur when I go down one column of B here so here I have the case where n is greater than M over B so I can't fit all of I can't fit one block from each row into the cache so how many cache misses do I have the first time I go down a column of B so how many rows of B do I have and yeah and how many cache misses do I need for each row one so in total I'm going to need n cache misses for the first column of B what about the second column of B so recall that I'm assuming the LRU replacement policy here so when the cache is full I'm gonna evict the thing that was least recently used used the furthest in the past sorry could you repeat that yes it's still gonna be n why is that yeah so it's still gonna be n because I can't fit one cache block from each row into my cache and by the time I get back to the top of my matrix B the top block has already been evicted from the cache and I have to load it back in and this is the same for every other block that I access I'm again gonna need n cache misses for the second column of B and this is going to be the same for all the columns of B and I have to do this again for the second row of a so in total I'm going to need theta of n cubed number of cache misses and this is one cache miss per per entry that I access in B and this is not very good because the total work was also theta of n cube so I'm not gaining anything from having any locality in this in this algorithm here so any questions on this analysis so this is just a case one let's look at case two so in this case n is less than C M over B so I can fit one block from each row of B into cache and then n is also greater than another constant C prime times square root of M so I can't fit the whole matrix into cache and again let's analyze the number of cache misses incurred by accessing be assuming LRU so how many cache misses am I going to incur for the first column of B and so the same as before what about the second column of B so by the time I get to the beginning of the beginning of the matrix here is the top block going to be in cache so who thinks the block is still gonna be in cache when I get back to the beginning yeah so a couple of people who think is gonna be out of cache okay so turns out it is going to be in cache because I can fit one block for every row of B into my since I have n less than cm over B so therefore when I get to the beginning of the second column that block is still gonna be in cash because I loaded it in when I was accessing the first column so I'm not gonna incur any cash misses for the second column and in general if if my if I can fit B columns or some constant times B columns in in into cash then I can reduce the number of cache misses I have by a factor of B so I only need to incur a cache miss the first time I access a block and not for all the subsequent accesses and the same is true for the second row of a and since I have n rows of a I'm gonna have n times theta of N squared over B cache misses for each row of a I'm gonna incur and squared over B cache misses so the overall number of cache misses is and cubed over B and this is because inside matrix B I can exploit spatial locality once I load in a block I can reuse it the next time I traverse down a column that's nearby any questions on this analysis okay so let's look at the third case and here n is less than C prime times square root of M so this means that the entire matrix fits into cache so let's analyze the number of cache misses for matrix B again assuming LRU so how many cache misses do I have now so let's let's count the total number of cache misses I have for every time I go through a row of a yes yeah so the firt for the first column is going to be n what about the second column right so for the so basically for the first row of a the analysis is going to be the same as before I need n squared over B cache misses to load the cache in what about the second row of a how many cache misses am I going to incur yeah so for the second row of a I'm not gonna incur any cash misses because once I load B into cash it's gonna stay in cash because the entire matrix can fit in cash I assume that n is less than C prime times square root of M so total number of cache misses I need for matrix B is theta of n squared over B since everything fits in cache and I just applied the sub matrix caching lemma from before overall this is not a very good algorithm because as you recall in case one I needed a cubic cubic number of cache misses what happens if I swap the order of the inner two loops so recall that this was one of the optimizations in lecture one when Charles was talking about matrix multiplication and how to speed it up so if I swap the order of the two inner loops then then for every iteration what I'm doing is I'm actually going over a row of C in a row of B and a stays fixed inside the innermost iteration so now when I analyze the number of cache misses of matrix B assuming LRU I'm gonna benefit from spatial locality since I'm going row by row and the matrix is stored in row major order so across all of the rows I'm just going to require theta of n squared over B cache misses and I have to do this end times for the outer loop so in total I'm going to get theta of n cubed over B cache misses so if you swap the order of the inner two loops this significantly improves the locality of your algorithm and you can benefit from spatial locality and that's why we saw a significant performance improvement in the first lecture when we swapped the order of the loops any questions does anybody think we can do better than n cubed over B cache misses or you think that it's the best you can do so how many people think you can do better yeah and how many people think this is the best you can do okay and how many people don't care so turns out you can do better and we're gonna do better by using an optimization called tiling so how this is gonna work is instead of just having three for loops I'm gonna have six for loops and I'm gonna loop over tiles so I've got loop over s by s sub matrices and within each sub makes listing I'm gonna do all the computation I need for that sub matrix before moving on to the next sub matrix so the three innermost loops are going to loop inside a sub matrix and the three outermost loops are going to loop within the larger matrix one sub matrix at a time so let's analyze the work of this algorithm so the work that we need to do for a sub matrix of size s by s is going to be s cubed since that's just a bound for matrix multiplication and then a number of times I have to operate on sub matrices is going to be n over s cubed and you can see this if you just consider each sub matrix to be a single element and then using the same cubic work analysis on the smaller matrix so the work is n over s cubed times s cubed which is equal to theta n cube so the work of this tiled matrix multiplies the same as the version that didn't do tiling and now let's analyze the number of cache misses so we're gonna tune s so that the sub-matrices just fit into cash so we're gonna set s to be equal to theta of square root of M who actually need to make this like one-third square root of M because we need to fit three sub matrices in the cache but it's going to be some constant times square root of M the sub-matrix cashing leva implies that for each sub matrix we're gonna need s squared over B misses to load it in and once we load it into cash it fits entirely into cache so we can do all of our computations within cash and not incur any more cache misses so therefore the total number of cache misses we're going to incur it's going to be the number of subproblems which is n over s cubed and then the number of cache misses per subproblem which is s squared over B and if you multiply this out you're gonna get and cubed over B times square root of M so here I plugged in square root of M for s and this is a pretty cool result because it says that you can actually do better than the n cubed over B bound you can improve this bound by a factor of square root of M and in practice square root of M is actually not insignificant so for example if you're looking at the last level cache the size of that is on the order of megabytes so a square root of M is going to be on the order of thousands so this significantly improves the performance of the matrix multiplication code if you tune s so that the sub matrices just fit in the cache it turns out that this bound is off to most so this was shown in 1981 so for cubic work matrix multiplication this is the best you can do if you use another matrix multiply algorithm like strassens algorithm you can do better so I want you to remember this bound it's a very important bound to know says that for matrix multiplication you can you can benefit both from spatial locality as well as temporal locality so I get spatial locality in the denominator the B term in the denominator and then the square root of M term comes from temporal locality since I'm doing all of the work inside the sub-matrix before I evict that sub matrix from cache any questions on this analysis so what's one issue with this algorithm here yes yeah so the problem here is I have two two nests for my particular machine and I call this a voodoo parameter it's sort of like a magic number I I put into my program so that it fits in the cache on the particular machine I'm running on and this makes the code not portable because when I try if I try to run this code on another machine the cache sizes might be different there and then I won't get the same performance as I did on my machine and this is also an issue even if you're running it on the same machine because you might have other programs running at the same time and using a part of the cache so you don't actually know how much of the cache your program actually gets to use in a multi programming environment and then this was also just for one level of cache if we want to optimize for two levels of caches we're gonna have to voodoo parameters s and T we're gonna have sub matrices and sub sub matrices and then we have to tune both of these parameters to get the best performance on our machine and multi-dimensional tuning optimization can't be done simply with binary search so if you're just tuning for one level of cache you can do a binary search on the parameter s but here you can't do binary search so so it's much more expensive to optimize here and the code becomes a little bit Messier you have nine four loops instead of six and how many levels of caches do we have on the machines that we're using today three so four three level cache you have three voodoo parameters you have 12 nested for-loops this code becomes very ugly and you have to tune these parameters for your particular machine and this makes the code not very portable as one student pointed out and in a multi programmed environment you don't actually know the effective cache size that your program has access to because other jobs are running at the same time and therefore it's very easy to miss - in the parameters with their question so I need questions yeah yes so you I mean you can auto tune your program so that it's optimized for the cache sizes of your particular machine instruction to get the size of your cache I'm not actually sure do you know yeah in the rock [Music] yeah proxy pu info yeah yes so you can probably get that as well yeah yeah but even if you do that and you're running this program when other jobs are running you don't actually know how much cash your program has access to yes no it's they're actually general-purpose we're actually I mean today we're just looking at matrix multiply but in on Thursday's lecture we'll actually be looking at many other problems and how to optimize them for the cache hierarchy other questions okay so this was a I mean this was a good algorithm in terms of cache performance but it wasn't very portable so let's see if we can do better let's see we can come up with a simpler design where we still get pretty good cache performance so we're going to turn to dividing conquer we're going to look at the recursive matrix multiplication algorithm that we saw before again we're going to deal with square matrices but the results generalize to non square matrices so how this works is we're going to split our input matrices into four sub sub matrices or four quadrants and then for each quadrant of the output matrix it's just going to be the sum of two matrix multiplies on n over two by n over two matrices so c1 1 is going to be a1 1 times b1 1 plus a12 times b1 b2 1 and then we're gonna do this recursively so every level of recursion we're gonna get eight multiply adds of n over two by n over two matrices here's what the recursive code looks like you can see that we have eight recursive calls here the base case here is of size 1 in practice you want a course in the base case to overcome function call overheads let's also look at what these values here correspond to so I've color-coded these that they correspond to particular elements in the sub matrix that I'm looking at on the right so these values here corresponds to the index of the first element in each of my quadrants so the first element in my first quadrant is just gonna have an offset of zero and then the first element of my second quadrant that's gonna be on the same row as the first element in my first quadrant so I I just need to add a difference between the I just need to add the width of my quadrant which is n over 2 and then to get the first element in quadrant 2 1 I'm going to jump over and over to rows and each row has a length row size so it's just gonna be an over two times row size and then to get the first element in Quadrant 2 2 it's just the first element in Quadrant 2 1 + + / 2 so that's n over 2 times row size plus 1 so let's analyze the work of this algorithm so what's the recurrence for this algorithm for the work of this algorithm so how many subproblems do we have 8 and what's the size of each subproblem n over 2 and how much work are we doing to set up the recursive calls constant amount of work so the recurrence is w of n is equal to 8 w n over 2 plus theta of 1 and what does that solve to and cubed so it's one of the three cases of the master theorem we're actually going to analyze this in more detail using by drawing out the recursion tree and this is going to give us more intuition about why the master theorem is true so at the top level of my recursion tree I'm gonna have a problem of size N and then I'm gonna branch into 8 sub problems of size n over 2 and then I'm gonna do a constant amount of work to set up the recursive calls here I'm just labeling this with one so I'm ignoring the constants but it's not gonna matter for asymptotic analysis and then I'm gonna branch again into 8 sub problems of size and over four and eventually I'm gonna get down to the leaves and how many levels do I have until I get to the leaves yes yes o log n what's the base of the log yes it's log base 2 of n because I'm dividing my problem size by two every time and therefore the number of leaves I have is going to be 8 to the log base 2 of n because I'm branching 8 ways every time 8 to the log base 2 of n is the same as n to the log base 2 of 8 which is n cubed the amount of work I'm doing at the top level is constant so I'm just gonna say one here the next log was eight times that then 64 and then when I get to the leaves it's going to be theta of n cubed since I have n cubed leaves and they're all doing constant work and the work is geometrically increasing as I go down the recursion tree so the overall work is just dominated by the work I need to do at the leaves so the overall work is just gonna be theta n cubed and this is the same as the looping versions of matrix multiply they're all cubic work now let's analyze the number of cache misses of this dividing conquer algorithm so now a migrant recurrence is gonna be different my base case now is when the sub matrix fits in the cache so when n squared is less than CM and when that's true I just need to load that some matrix in the cache and then I don't incur any more cache misses so I need theta of N squared over the cache misses when N squared is less than CM for some sophisticated sufficiently small constant C less than equal to 1 and then otherwise i recurse into 8 sub problems of size n over 2 and then I add theta of 1 because I'm doing a constant amount of work to set up the recursive calls and I get this theta of N squared over B term from the sub matrix cashing lemma says I can just load the entire matrix into cache with this many cache misses so so the difference between the cache analysis here and the work analysis before is that I have a different base case and I think in all of that that you've seen before the base case was always of a constant size but here we're working with a base case that's not of a constant size so let's try to analyze this using the recursion tree approach so at the top level I have a problem of size n then I'm going to branch and do eight problems of size n over two and then I'm also going to incur constant number of cache misses I'm just gonna say one here then I'm gonna branch again and then eventually I'm going to get to the base case where N squared is less than CM and when N squared is less than CM then the number of cache misses that I'm gonna incur is going to be theta of CM over B so I can just plug in CM here for n squared and the number of levels of recursion I have in this recursion tree is no longer just log base two of n I'm gonna have log base two of n minus log base two of square root of CM number of levels which is the same as log base two of n minus 1/2 times log base two of CM and then the number of leaves I get is going to be eight to this to this number of levels here says 8 to log base two of n minus 1/2 of log base two of cm and this is equal to theta of n cubed over m to the three halves so the n cubed comes from the eight to the log base two of n term and then if I do eight to the log base 8 to the negative one half of log base two of CM that's just gonna give me m over m to the three halves in the denominator so any questions on how I computed the number of levels of this recursion tree here so I'm basically dividing my problem size by two until I get to a problem size that fits into cash so that means n is less than square root of CM so therefore I can subtract that many levels for my recursion tree and then to get the number of leaves since I'm branching eight ways I just do eight to the power of the number of levels I have and then that gives me the total number of leaves so now let's analyze the number of cache misses I need at each level of this recursion tree at the top level I have a constant number of cache misses let's just say one the next level I have eight sixty-four and then at the leaves I'm going to have theta of n cubed over B times square root of M cache misses and I got this quantity just by multiplying the number of leaves by the number of cache misses per leaf so number of leaves is n cubed over m to the three halves the cache misses per leaves is state of CM over B so I lose one factor of B in the denominator I'm left with the square root of M at the bottom and then I also divide by the block size B so overall I get n cubed over B times square root of M cache misses and again this is a geometric series and the number of cache misses at the leaves dominates all of the other levels so the total number of cache misses I have is going to be theta n cubed over B times square root of M and notice that I'm getting the same number of cache misses as I did with the tiling version of the code but here I don't actually have the two in my code for the particular cache size so what cache sizes does this code work for so is this code gonna work on your machine it's gonna get good cache performance so this code is gonna work for all cache sizes because I didn't tune it for any particular cache size and this is what's known as a cache oblivious algorithm it doesn't have any Voodoo tuning parameters has no explicit knowledge of the caches and it's essentially passively Auto tuning itself for the particular cache size of your machine it can also work for multi-level caches automatically because I never specified what level of cache I'm analyzing this for I can analyze it for any level of cache and it's still gonna give me good cache complexity and this is also good in multi programmed environments where you might have other jobs running and you don't know your effective cache size this is just gonna pass all the auto-tune for whatever cache size is available turns out that the best cache oblivious codes today work on arbitrary rectangular matrices I just talked about square matrices but the best codes work on rectangular matrices and they perform binary splitting instead of eight-way splitting and you split on the largest of I J and K so this is what the best cache oblivious matrix multiplication algorithm does any questions okay so I only talked about the serial setting so far I was assuming that these algorithms ran on just a single thread what happens if I go to multiple processors turns out that the results do generalize to a parallel context so this is the recursive parallel matrix-multiply code that we saw before and notice that we're executing four sub calls in parallel doing a sync and then doing four more sub calls in parallel so let's try to analyze the number of cache misses in this parallel code and to do that we're gonna use this theorem which says that let Q sub P be the number of cache misses in a deterministic cell computation won't run on P processors each with a private cache of size M and let S sub P be the number of successful steals during the computation in the ideal cache model the number of cache misses we're gonna have is Q sub P equal to Q sub 1 plus Big O of number of steals times M over B so a number of cache misses in the parallel context is equal to a number of cache misses when you run it serially plus this term here which is the number of steals times M over B and a proof for this goes as fall so every call in the silk runtime system we can have workers steal tasks from other workers when they don't have work to do and after workers steals it has from another worker its cache becomes completely cold in the worst case because it wasn't actually working on that sub problem before but after M over B cold cache misses its cache is going to become identical to what it would be in the serial execution so we just need to pay em over B cache misses to make make it so that the cache looks the same as if they were executing serially and the same is true when a worker resumes a stolen computation after a silk sync and the number of times that these two situations can happen is two times SP two times the number of steals and each time we have to pay I'm over B cache misses and this is where this additive term comes from order S sub P times M over B we also know that the number of steals in a silk program is upper bounded by P times T infinity an expectation where P is a number of processors and T's infinity is the span of your computation so if you can minimize the span of your computation then this also gives you good cache bounds so moral of the story here is that minimizing the number of cache misses in the serial illusion essentially minimizes them in the parallel execution for a low span algorithm so in this recursive matrix multiplication algorithm the span of this is as follows so T infinity of n is 2 T infinity of n over 2 plus theta of 1 since we're doing a sync here we have to pay the critical path length of two sub calls this solves two theta of N and applying the previous lemma this gives us a cache miss bound of theta of n cubed over B square root of M this is just the same as the serial execution and then this additive term is going to be order P times n and there's a span times M over B so that was a parallel that was a parallel algorithm for matrix multiply and we saw that we can also get good cache bounds there so here's a summary of what we talked about today we talked about Sousa tivity and caches different different ways you can design a cache talked about the ideal cache model that's useful for analyzing algorithms we talked about cache aware algorithm that have explicit knowledge of the cache and the example we use was tiled matrix multiply then we came up with a much simpler algorithm that was cache oblivious using dividing conquer and then on Thursdays lecture will actually see much more on cache oblivious algorithm designed and then you'll an opportunity to analyze the cash efficiency of some algorithms in the next homework 3 00:00:05,769 --> 00:00:08,019 4 00:00:08,019 --> 00:00:09,850 5 00:00:09,850 --> 00:00:10,930 6 00:00:10,930 --> 00:00:13,120 7 00:00:13,120 --> 00:00:15,160 8 00:00:15,160 --> 00:00:21,389 9 00:00:21,389 --> 00:00:24,760 10 00:00:24,760 --> 00:00:26,470 11 00:00:26,470 --> 00:00:29,380 12 00:00:29,380 --> 00:00:31,359 13 00:00:31,359 --> 00:00:35,650 14 00:00:35,650 --> 00:00:38,710 15 00:00:38,710 --> 00:00:42,760 16 00:00:42,760 --> 00:00:45,130 17 00:00:45,130 --> 00:00:47,560 18 00:00:47,560 --> 00:00:50,590 19 00:00:50,590 --> 00:00:54,520 20 00:00:54,520 --> 00:00:56,650 21 00:00:56,650 --> 00:00:58,660 22 00:00:58,660 --> 00:01:01,479 23 00:01:01,479 --> 00:01:06,070 24 00:01:06,070 --> 00:01:08,649 25 00:01:08,649 --> 00:01:11,500 26 00:01:11,500 --> 00:01:16,240 27 00:01:16,240 --> 00:01:17,950 28 00:01:17,950 --> 00:01:19,810 29 00:01:19,810 --> 00:01:21,550 30 00:01:21,550 --> 00:01:25,120 31 00:01:25,120 --> 00:01:30,340 32 00:01:30,340 --> 00:01:32,080 33 00:01:32,080 --> 00:01:35,260 34 00:01:35,260 --> 00:01:37,090 35 00:01:37,090 --> 00:01:41,710 36 00:01:41,710 --> 00:01:45,789 37 00:01:45,789 --> 00:01:47,469 38 00:01:47,469 --> 00:01:49,270 39 00:01:49,270 --> 00:01:51,760 40 00:01:51,760 --> 00:01:55,780 41 00:01:55,780 --> 00:01:58,570 42 00:01:58,570 --> 00:02:02,830 43 00:02:02,830 --> 00:02:05,050 44 00:02:05,050 --> 00:02:07,120 45 00:02:07,120 --> 00:02:08,830 46 00:02:08,830 --> 00:02:13,119 47 00:02:13,119 --> 00:02:14,590 48 00:02:14,590 --> 00:02:18,880 49 00:02:18,880 --> 00:02:21,790 50 00:02:21,790 --> 00:02:27,940 51 00:02:27,940 --> 00:02:30,220 52 00:02:30,220 --> 00:02:31,840 53 00:02:31,840 --> 00:02:35,050 54 00:02:35,050 --> 00:02:37,450 55 00:02:37,450 --> 00:02:39,190 56 00:02:39,190 --> 00:02:40,840 57 00:02:40,840 --> 00:02:43,480 58 00:02:43,480 --> 00:02:46,720 59 00:02:46,720 --> 00:02:49,900 60 00:02:49,900 --> 00:02:53,650 61 00:02:53,650 --> 00:02:56,110 62 00:02:56,110 --> 00:02:58,300 63 00:02:58,300 --> 00:03:00,160 64 00:03:00,160 --> 00:03:03,700 65 00:03:03,700 --> 00:03:08,500 66 00:03:08,500 --> 00:03:10,600 67 00:03:10,600 --> 00:03:12,010 68 00:03:12,010 --> 00:03:14,050 69 00:03:14,050 --> 00:03:16,210 70 00:03:16,210 --> 00:03:19,210 71 00:03:19,210 --> 00:03:21,250 72 00:03:21,250 --> 00:03:23,140 73 00:03:23,140 --> 00:03:25,570 74 00:03:25,570 --> 00:03:27,490 75 00:03:27,490 --> 00:03:30,940 76 00:03:30,940 --> 00:03:32,710 77 00:03:32,710 --> 00:03:35,760 78 00:03:35,760 --> 00:03:38,260 79 00:03:38,260 --> 00:03:41,350 80 00:03:41,350 --> 00:03:44,370 81 00:03:44,370 --> 00:03:50,560 82 00:03:50,560 --> 00:03:52,630 83 00:03:52,630 --> 00:03:55,180 84 00:03:55,180 --> 00:03:57,700 85 00:03:57,700 --> 00:03:59,830 86 00:03:59,830 --> 00:04:01,390 87 00:04:01,390 --> 00:04:03,610 88 00:04:03,610 --> 00:04:05,260 89 00:04:05,260 --> 00:04:08,740 90 00:04:08,740 --> 00:04:11,800 91 00:04:11,800 --> 00:04:13,630 92 00:04:13,630 --> 00:04:15,820 93 00:04:15,820 --> 00:04:18,099 94 00:04:18,099 --> 00:04:20,110 95 00:04:20,110 --> 00:04:21,789 96 00:04:21,789 --> 00:04:26,660 97 00:04:26,660 --> 00:04:35,070 98 00:04:35,070 --> 00:04:37,440 99 00:04:37,440 --> 00:04:39,990 100 00:04:39,990 --> 00:04:42,330 101 00:04:42,330 --> 00:04:44,850 102 00:04:44,850 --> 00:04:47,250 103 00:04:47,250 --> 00:04:49,170 104 00:04:49,170 --> 00:04:52,170 105 00:04:52,170 --> 00:04:54,750 106 00:04:54,750 --> 00:04:56,820 107 00:04:56,820 --> 00:05:00,450 108 00:05:00,450 --> 00:05:01,950 109 00:05:01,950 --> 00:05:06,030 110 00:05:06,030 --> 00:05:07,860 111 00:05:07,860 --> 00:05:10,680 112 00:05:10,680 --> 00:05:13,710 113 00:05:13,710 --> 00:05:15,810 114 00:05:15,810 --> 00:05:18,659 115 00:05:18,659 --> 00:05:20,909 116 00:05:20,909 --> 00:05:26,550 117 00:05:26,550 --> 00:05:29,640 118 00:05:29,640 --> 00:05:31,590 119 00:05:31,590 --> 00:05:34,250 120 00:05:34,250 --> 00:05:37,740 121 00:05:37,740 --> 00:05:40,680 122 00:05:40,680 --> 00:05:43,920 123 00:05:43,920 --> 00:05:45,750 124 00:05:45,750 --> 00:05:48,420 125 00:05:48,420 --> 00:05:50,010 126 00:05:50,010 --> 00:05:52,050 127 00:05:52,050 --> 00:05:54,510 128 00:05:54,510 --> 00:05:56,940 129 00:05:56,940 --> 00:05:58,740 130 00:05:58,740 --> 00:06:01,170 131 00:06:01,170 --> 00:06:02,550 132 00:06:02,550 --> 00:06:04,230 133 00:06:04,230 --> 00:06:07,200 134 00:06:07,200 --> 00:06:09,870 135 00:06:09,870 --> 00:06:12,210 136 00:06:12,210 --> 00:06:14,460 137 00:06:14,460 --> 00:06:17,580 138 00:06:17,580 --> 00:06:20,550 139 00:06:20,550 --> 00:06:23,600 140 00:06:23,600 --> 00:06:26,730 141 00:06:26,730 --> 00:06:28,440 142 00:06:28,440 --> 00:06:31,560 143 00:06:31,560 --> 00:06:32,880 144 00:06:32,880 --> 00:06:34,680 145 00:06:34,680 --> 00:06:39,240 146 00:06:39,240 --> 00:06:41,520 147 00:06:41,520 --> 00:06:43,740 148 00:06:43,740 --> 00:06:45,290 149 00:06:45,290 --> 00:06:47,790 150 00:06:47,790 --> 00:06:49,920 151 00:06:49,920 --> 00:06:54,990 152 00:06:54,990 --> 00:06:55,000 153 00:06:55,000 --> 00:07:04,700 154 00:07:04,700 --> 00:07:12,300 155 00:07:12,300 --> 00:07:14,879 156 00:07:14,879 --> 00:07:16,830 157 00:07:16,830 --> 00:07:18,360 158 00:07:18,360 --> 00:07:19,980 159 00:07:19,980 --> 00:07:22,740 160 00:07:22,740 --> 00:07:26,010 161 00:07:26,010 --> 00:07:27,600 162 00:07:27,600 --> 00:07:29,400 163 00:07:29,400 --> 00:07:32,070 164 00:07:32,070 --> 00:07:35,250 165 00:07:35,250 --> 00:07:38,700 166 00:07:38,700 --> 00:07:40,950 167 00:07:40,950 --> 00:07:44,070 168 00:07:44,070 --> 00:07:46,320 169 00:07:46,320 --> 00:07:49,560 170 00:07:49,560 --> 00:07:52,140 171 00:07:52,140 --> 00:07:55,440 172 00:07:55,440 --> 00:08:02,719 173 00:08:02,719 --> 00:08:05,850 174 00:08:05,850 --> 00:08:09,360 175 00:08:09,360 --> 00:08:11,640 176 00:08:11,640 --> 00:08:14,969 177 00:08:14,969 --> 00:08:18,750 178 00:08:18,750 --> 00:08:21,180 179 00:08:21,180 --> 00:08:23,490 180 00:08:23,490 --> 00:08:25,920 181 00:08:25,920 --> 00:08:29,670 182 00:08:29,670 --> 00:08:33,589 183 00:08:33,589 --> 00:08:36,390 184 00:08:36,390 --> 00:08:39,029 185 00:08:39,029 --> 00:08:41,550 186 00:08:41,550 --> 00:08:44,550 187 00:08:44,550 --> 00:08:46,260 188 00:08:46,260 --> 00:08:48,120 189 00:08:48,120 --> 00:08:49,770 190 00:08:49,770 --> 00:08:51,390 191 00:08:51,390 --> 00:08:53,310 192 00:08:53,310 --> 00:08:57,840 193 00:08:57,840 --> 00:09:00,510 194 00:09:00,510 --> 00:09:05,190 195 00:09:05,190 --> 00:09:08,220 196 00:09:08,220 --> 00:09:11,490 197 00:09:11,490 --> 00:09:13,380 198 00:09:13,380 --> 00:09:16,380 199 00:09:16,380 --> 00:09:18,170 200 00:09:18,170 --> 00:09:21,060 201 00:09:21,060 --> 00:09:24,150 202 00:09:24,150 --> 00:09:24,160 203 00:09:24,160 --> 00:09:25,540 204 00:09:25,540 --> 00:09:27,400 205 00:09:27,400 --> 00:09:30,040 206 00:09:30,040 --> 00:09:33,130 207 00:09:33,130 --> 00:09:35,020 208 00:09:35,020 --> 00:09:37,230 209 00:09:37,230 --> 00:09:41,170 210 00:09:41,170 --> 00:09:43,450 211 00:09:43,450 --> 00:09:47,560 212 00:09:47,560 --> 00:09:51,720 213 00:09:51,720 --> 00:09:55,420 214 00:09:55,420 --> 00:09:58,150 215 00:09:58,150 --> 00:10:03,760 216 00:10:03,760 --> 00:10:06,040 217 00:10:06,040 --> 00:10:16,380 218 00:10:16,380 --> 00:10:18,660 219 00:10:18,660 --> 00:10:19,949 220 00:10:19,949 --> 00:10:24,300 221 00:10:24,300 --> 00:10:26,009 222 00:10:26,009 --> 00:10:27,780 223 00:10:27,780 --> 00:10:29,940 224 00:10:29,940 --> 00:10:31,170 225 00:10:31,170 --> 00:10:33,480 226 00:10:33,480 --> 00:10:35,340 227 00:10:35,340 --> 00:10:38,910 228 00:10:38,910 --> 00:10:49,440 229 00:10:49,440 --> 00:10:51,900 230 00:10:51,900 --> 00:10:54,750 231 00:10:54,750 --> 00:10:58,949 232 00:10:58,949 --> 00:11:00,540 233 00:11:00,540 --> 00:11:02,370 234 00:11:02,370 --> 00:11:04,290 235 00:11:04,290 --> 00:11:06,060 236 00:11:06,060 --> 00:11:07,829 237 00:11:07,829 --> 00:11:09,930 238 00:11:09,930 --> 00:11:13,110 239 00:11:13,110 --> 00:11:16,500 240 00:11:16,500 --> 00:11:18,840 241 00:11:18,840 --> 00:11:20,280 242 00:11:20,280 --> 00:11:23,310 243 00:11:23,310 --> 00:11:24,660 244 00:11:24,660 --> 00:11:30,269 245 00:11:30,269 --> 00:11:32,160 246 00:11:32,160 --> 00:11:34,139 247 00:11:34,139 --> 00:11:37,889 248 00:11:37,889 --> 00:11:39,420 249 00:11:39,420 --> 00:11:41,130 250 00:11:41,130 --> 00:11:44,310 251 00:11:44,310 --> 00:11:52,439 252 00:11:52,439 --> 00:11:55,420 253 00:11:55,420 --> 00:11:58,989 254 00:11:58,989 --> 00:12:01,720 255 00:12:01,720 --> 00:12:04,929 256 00:12:04,929 --> 00:12:06,939 257 00:12:06,939 --> 00:12:09,160 258 00:12:09,160 --> 00:12:12,790 259 00:12:12,790 --> 00:12:15,549 260 00:12:15,549 --> 00:12:17,379 261 00:12:17,379 --> 00:12:22,989 262 00:12:22,989 --> 00:12:25,119 263 00:12:25,119 --> 00:12:27,129 264 00:12:27,129 --> 00:12:30,759 265 00:12:30,759 --> 00:12:34,150 266 00:12:34,150 --> 00:12:36,280 267 00:12:36,280 --> 00:12:39,759 268 00:12:39,759 --> 00:12:45,249 269 00:12:45,249 --> 00:12:48,819 270 00:12:48,819 --> 00:12:52,030 271 00:12:52,030 --> 00:12:54,549 272 00:12:54,549 --> 00:12:58,960 273 00:12:58,960 --> 00:13:02,860 274 00:13:02,860 --> 00:13:06,280 275 00:13:06,280 --> 00:13:09,360 276 00:13:09,360 --> 00:13:13,419 277 00:13:13,419 --> 00:13:15,460 278 00:13:15,460 --> 00:13:17,889 279 00:13:17,889 --> 00:13:23,110 280 00:13:23,110 --> 00:13:26,559 281 00:13:26,559 --> 00:13:28,960 282 00:13:28,960 --> 00:13:28,970 283 00:13:28,970 --> 00:13:30,249 284 00:13:30,249 --> 00:13:32,650 285 00:13:32,650 --> 00:13:34,539 286 00:13:34,539 --> 00:13:37,480 287 00:13:37,480 --> 00:13:41,400 288 00:13:41,400 --> 00:13:44,889 289 00:13:44,889 --> 00:13:46,419 290 00:13:46,419 --> 00:13:47,859 291 00:13:47,859 --> 00:13:49,749 292 00:13:49,749 --> 00:13:52,119 293 00:13:52,119 --> 00:13:54,970 294 00:13:54,970 --> 00:13:57,100 295 00:13:57,100 --> 00:13:59,379 296 00:13:59,379 --> 00:14:04,059 297 00:14:04,059 --> 00:14:04,069 298 00:14:04,069 --> 00:14:04,519 299 00:14:04,519 --> 00:14:08,960 300 00:14:08,960 --> 00:14:11,090 301 00:14:11,090 --> 00:14:13,850 302 00:14:13,850 --> 00:14:15,590 303 00:14:15,590 --> 00:14:17,329 304 00:14:17,329 --> 00:14:19,840 305 00:14:19,840 --> 00:14:22,730 306 00:14:22,730 --> 00:14:27,889 307 00:14:27,889 --> 00:14:38,629 308 00:14:38,629 --> 00:14:42,530 309 00:14:42,530 --> 00:14:44,509 310 00:14:44,509 --> 00:14:47,090 311 00:14:47,090 --> 00:14:49,309 312 00:14:49,309 --> 00:14:51,139 313 00:14:51,139 --> 00:14:54,019 314 00:14:54,019 --> 00:14:55,429 315 00:14:55,429 --> 00:14:57,410 316 00:14:57,410 --> 00:14:59,170 317 00:14:59,170 --> 00:15:01,639 318 00:15:01,639 --> 00:15:02,840 319 00:15:02,840 --> 00:15:06,129 320 00:15:06,129 --> 00:15:10,009 321 00:15:10,009 --> 00:15:12,949 322 00:15:12,949 --> 00:15:15,170 323 00:15:15,170 --> 00:15:16,790 324 00:15:16,790 --> 00:15:19,970 325 00:15:19,970 --> 00:15:22,069 326 00:15:22,069 --> 00:15:23,840 327 00:15:23,840 --> 00:15:26,210 328 00:15:26,210 --> 00:15:28,429 329 00:15:28,429 --> 00:15:31,369 330 00:15:31,369 --> 00:15:32,869 331 00:15:32,869 --> 00:15:34,249 332 00:15:34,249 --> 00:15:37,299 333 00:15:37,299 --> 00:15:41,150 334 00:15:41,150 --> 00:15:43,519 335 00:15:43,519 --> 00:15:46,519 336 00:15:46,519 --> 00:15:48,199 337 00:15:48,199 --> 00:15:50,629 338 00:15:50,629 --> 00:15:54,980 339 00:15:54,980 --> 00:15:58,280 340 00:15:58,280 --> 00:16:00,559 341 00:16:00,559 --> 00:16:04,639 342 00:16:04,639 --> 00:16:08,299 343 00:16:08,299 --> 00:16:10,879 344 00:16:10,879 --> 00:16:13,790 345 00:16:13,790 --> 00:16:15,769 346 00:16:15,769 --> 00:16:17,389 347 00:16:17,389 --> 00:16:17,399 348 00:16:17,399 --> 00:16:18,300 349 00:16:18,300 --> 00:16:20,180 350 00:16:20,180 --> 00:16:24,840 351 00:16:24,840 --> 00:16:26,670 352 00:16:26,670 --> 00:16:29,210 353 00:16:29,210 --> 00:16:32,519 354 00:16:32,519 --> 00:16:35,720 355 00:16:35,720 --> 00:16:38,370 356 00:16:38,370 --> 00:16:40,590 357 00:16:40,590 --> 00:16:45,870 358 00:16:45,870 --> 00:16:48,810 359 00:16:48,810 --> 00:16:51,480 360 00:16:51,480 --> 00:16:53,190 361 00:16:53,190 --> 00:16:56,460 362 00:16:56,460 --> 00:16:59,720 363 00:16:59,720 --> 00:17:03,030 364 00:17:03,030 --> 00:17:05,100 365 00:17:05,100 --> 00:17:07,260 366 00:17:07,260 --> 00:17:09,900 367 00:17:09,900 --> 00:17:11,699 368 00:17:11,699 --> 00:17:13,620 369 00:17:13,620 --> 00:17:15,990 370 00:17:15,990 --> 00:17:17,970 371 00:17:17,970 --> 00:17:19,740 372 00:17:19,740 --> 00:17:22,620 373 00:17:22,620 --> 00:17:24,630 374 00:17:24,630 --> 00:17:26,760 375 00:17:26,760 --> 00:17:29,690 376 00:17:29,690 --> 00:17:32,190 377 00:17:32,190 --> 00:17:36,660 378 00:17:36,660 --> 00:17:38,550 379 00:17:38,550 --> 00:17:40,590 380 00:17:40,590 --> 00:17:42,810 381 00:17:42,810 --> 00:17:44,640 382 00:17:44,640 --> 00:17:46,560 383 00:17:46,560 --> 00:17:49,410 384 00:17:49,410 --> 00:17:53,280 385 00:17:53,280 --> 00:17:55,020 386 00:17:55,020 --> 00:17:56,520 387 00:17:56,520 --> 00:17:58,910 388 00:17:58,910 --> 00:18:03,060 389 00:18:03,060 --> 00:18:04,980 390 00:18:04,980 --> 00:18:06,630 391 00:18:06,630 --> 00:18:09,120 392 00:18:09,120 --> 00:18:11,070 393 00:18:11,070 --> 00:18:14,280 394 00:18:14,280 --> 00:18:15,840 395 00:18:15,840 --> 00:18:17,580 396 00:18:17,580 --> 00:18:20,760 397 00:18:20,760 --> 00:18:22,740 398 00:18:22,740 --> 00:18:25,560 399 00:18:25,560 --> 00:18:27,840 400 00:18:27,840 --> 00:18:29,850 401 00:18:29,850 --> 00:18:31,890 402 00:18:31,890 --> 00:18:33,780 403 00:18:33,780 --> 00:18:35,880 404 00:18:35,880 --> 00:18:37,740 405 00:18:37,740 --> 00:18:39,510 406 00:18:39,510 --> 00:18:41,460 407 00:18:41,460 --> 00:18:43,620 408 00:18:43,620 --> 00:18:45,750 409 00:18:45,750 --> 00:18:49,550 410 00:18:49,550 --> 00:18:53,660 411 00:18:53,660 --> 00:18:56,850 412 00:18:56,850 --> 00:18:59,790 413 00:18:59,790 --> 00:19:03,110 414 00:19:03,110 --> 00:19:05,160 415 00:19:05,160 --> 00:19:07,350 416 00:19:07,350 --> 00:19:09,000 417 00:19:09,000 --> 00:19:12,480 418 00:19:12,480 --> 00:19:14,400 419 00:19:14,400 --> 00:19:16,440 420 00:19:16,440 --> 00:19:18,000 421 00:19:18,000 --> 00:19:25,380 422 00:19:25,380 --> 00:19:30,330 423 00:19:30,330 --> 00:19:32,850 424 00:19:32,850 --> 00:19:35,750 425 00:19:35,750 --> 00:19:38,460 426 00:19:38,460 --> 00:19:42,990 427 00:19:42,990 --> 00:19:46,910 428 00:19:46,910 --> 00:19:50,280 429 00:19:50,280 --> 00:19:52,080 430 00:19:52,080 --> 00:19:56,850 431 00:19:56,850 --> 00:19:59,850 432 00:19:59,850 --> 00:20:02,700 433 00:20:02,700 --> 00:20:07,740 434 00:20:07,740 --> 00:20:11,310 435 00:20:11,310 --> 00:20:13,020 436 00:20:13,020 --> 00:20:17,310 437 00:20:17,310 --> 00:20:20,400 438 00:20:20,400 --> 00:20:26,610 439 00:20:26,610 --> 00:20:29,580 440 00:20:29,580 --> 00:20:33,420 441 00:20:33,420 --> 00:20:38,490 442 00:20:38,490 --> 00:20:40,230 443 00:20:40,230 --> 00:20:47,490 444 00:20:47,490 --> 00:20:50,500 445 00:20:50,500 --> 00:20:54,520 446 00:20:54,520 --> 00:20:57,250 447 00:20:57,250 --> 00:21:03,580 448 00:21:03,580 --> 00:21:09,460 449 00:21:09,460 --> 00:21:16,600 450 00:21:16,600 --> 00:21:19,090 451 00:21:19,090 --> 00:21:21,730 452 00:21:21,730 --> 00:21:25,030 453 00:21:25,030 --> 00:21:31,490 454 00:21:31,490 --> 00:21:39,570 455 00:21:39,570 --> 00:21:44,370 456 00:21:44,370 --> 00:21:47,100 457 00:21:47,100 --> 00:21:49,740 458 00:21:49,740 --> 00:21:52,140 459 00:21:52,140 --> 00:21:55,370 460 00:21:55,370 --> 00:21:57,570 461 00:21:57,570 --> 00:21:59,670 462 00:21:59,670 --> 00:22:04,560 463 00:22:04,560 --> 00:22:06,690 464 00:22:06,690 --> 00:22:11,490 465 00:22:11,490 --> 00:22:13,050 466 00:22:13,050 --> 00:22:15,680 467 00:22:15,680 --> 00:22:18,270 468 00:22:18,270 --> 00:22:21,150 469 00:22:21,150 --> 00:22:23,340 470 00:22:23,340 --> 00:22:25,440 471 00:22:25,440 --> 00:22:28,140 472 00:22:28,140 --> 00:22:30,090 473 00:22:30,090 --> 00:22:33,360 474 00:22:33,360 --> 00:22:34,980 475 00:22:34,980 --> 00:22:37,020 476 00:22:37,020 --> 00:22:39,180 477 00:22:39,180 --> 00:22:42,150 478 00:22:42,150 --> 00:22:44,010 479 00:22:44,010 --> 00:22:48,840 480 00:22:48,840 --> 00:22:51,030 481 00:22:51,030 --> 00:22:54,210 482 00:22:54,210 --> 00:22:54,220 483 00:22:54,220 --> 00:23:04,850 484 00:23:04,850 --> 00:23:04,860 485 00:23:04,860 --> 00:23:09,240 486 00:23:09,240 --> 00:23:11,620 487 00:23:11,620 --> 00:23:14,650 488 00:23:14,650 --> 00:23:17,260 489 00:23:17,260 --> 00:23:19,660 490 00:23:19,660 --> 00:23:23,260 491 00:23:23,260 --> 00:23:25,300 492 00:23:25,300 --> 00:23:29,230 493 00:23:29,230 --> 00:23:31,330 494 00:23:31,330 --> 00:23:34,480 495 00:23:34,480 --> 00:23:37,240 496 00:23:37,240 --> 00:23:41,050 497 00:23:41,050 --> 00:23:44,500 498 00:23:44,500 --> 00:23:49,030 499 00:23:49,030 --> 00:23:51,250 500 00:23:51,250 --> 00:23:53,350 501 00:23:53,350 --> 00:23:55,380 502 00:23:55,380 --> 00:23:58,870 503 00:23:58,870 --> 00:24:00,850 504 00:24:00,850 --> 00:24:02,680 505 00:24:02,680 --> 00:24:04,810 506 00:24:04,810 --> 00:24:06,880 507 00:24:06,880 --> 00:24:11,350 508 00:24:11,350 --> 00:24:14,590 509 00:24:14,590 --> 00:24:14,600 510 00:24:14,600 --> 00:24:22,000 511 00:24:22,000 --> 00:24:25,280 512 00:24:25,280 --> 00:24:28,520 513 00:24:28,520 --> 00:24:31,220 514 00:24:31,220 --> 00:24:39,230 515 00:24:39,230 --> 00:24:45,230 516 00:24:45,230 --> 00:24:47,779 517 00:24:47,779 --> 00:24:49,760 518 00:24:49,760 --> 00:24:52,580 519 00:24:52,580 --> 00:24:54,919 520 00:24:54,919 --> 00:24:57,500 521 00:24:57,500 --> 00:25:00,519 522 00:25:00,519 --> 00:25:03,049 523 00:25:03,049 --> 00:25:03,059 524 00:25:03,059 --> 00:25:03,710 525 00:25:03,710 --> 00:25:06,470 526 00:25:06,470 --> 00:25:09,560 527 00:25:09,560 --> 00:25:12,409 528 00:25:12,409 --> 00:25:15,409 529 00:25:15,409 --> 00:25:17,419 530 00:25:17,419 --> 00:25:19,720 531 00:25:19,720 --> 00:25:23,210 532 00:25:23,210 --> 00:25:25,430 533 00:25:25,430 --> 00:25:27,500 534 00:25:27,500 --> 00:25:30,799 535 00:25:30,799 --> 00:25:33,620 536 00:25:33,620 --> 00:25:35,630 537 00:25:35,630 --> 00:25:37,310 538 00:25:37,310 --> 00:25:40,120 539 00:25:40,120 --> 00:25:42,830 540 00:25:42,830 --> 00:25:45,560 541 00:25:45,560 --> 00:25:48,230 542 00:25:48,230 --> 00:25:50,799 543 00:25:50,799 --> 00:25:53,720 544 00:25:53,720 --> 00:25:55,519 545 00:25:55,519 --> 00:26:04,950 546 00:26:04,950 --> 00:26:09,460 547 00:26:09,460 --> 00:26:10,960 548 00:26:10,960 --> 00:26:12,370 549 00:26:12,370 --> 00:26:14,800 550 00:26:14,800 --> 00:26:16,420 551 00:26:16,420 --> 00:26:20,080 552 00:26:20,080 --> 00:26:22,270 553 00:26:22,270 --> 00:26:24,280 554 00:26:24,280 --> 00:26:26,080 555 00:26:26,080 --> 00:26:28,810 556 00:26:28,810 --> 00:26:32,380 557 00:26:32,380 --> 00:26:35,290 558 00:26:35,290 --> 00:26:41,080 559 00:26:41,080 --> 00:26:43,120 560 00:26:43,120 --> 00:26:45,100 561 00:26:45,100 --> 00:26:46,900 562 00:26:46,900 --> 00:26:52,930 563 00:26:52,930 --> 00:26:55,090 564 00:26:55,090 --> 00:26:58,900 565 00:26:58,900 --> 00:27:03,010 566 00:27:03,010 --> 00:27:06,370 567 00:27:06,370 --> 00:27:08,890 568 00:27:08,890 --> 00:27:13,450 569 00:27:13,450 --> 00:27:15,160 570 00:27:15,160 --> 00:27:17,230 571 00:27:17,230 --> 00:27:19,360 572 00:27:19,360 --> 00:27:22,060 573 00:27:22,060 --> 00:27:24,160 574 00:27:24,160 --> 00:27:25,840 575 00:27:25,840 --> 00:27:27,700 576 00:27:27,700 --> 00:27:29,800 577 00:27:29,800 --> 00:27:31,390 578 00:27:31,390 --> 00:27:33,160 579 00:27:33,160 --> 00:27:35,140 580 00:27:35,140 --> 00:27:36,820 581 00:27:36,820 --> 00:27:39,070 582 00:27:39,070 --> 00:27:40,780 583 00:27:40,780 --> 00:27:46,210 584 00:27:46,210 --> 00:27:48,580 585 00:27:48,580 --> 00:27:50,980 586 00:27:50,980 --> 00:27:52,870 587 00:27:52,870 --> 00:27:57,460 588 00:27:57,460 --> 00:27:59,230 589 00:27:59,230 --> 00:28:01,360 590 00:28:01,360 --> 00:28:03,460 591 00:28:03,460 --> 00:28:06,130 592 00:28:06,130 --> 00:28:08,080 593 00:28:08,080 --> 00:28:10,390 594 00:28:10,390 --> 00:28:13,900 595 00:28:13,900 --> 00:28:15,820 596 00:28:15,820 --> 00:28:16,610 597 00:28:16,610 --> 00:28:19,670 598 00:28:19,670 --> 00:28:22,310 599 00:28:22,310 --> 00:28:23,960 600 00:28:23,960 --> 00:28:26,270 601 00:28:26,270 --> 00:28:27,860 602 00:28:27,860 --> 00:28:34,730 603 00:28:34,730 --> 00:28:36,830 604 00:28:36,830 --> 00:28:38,660 605 00:28:38,660 --> 00:28:41,150 606 00:28:41,150 --> 00:28:44,750 607 00:28:44,750 --> 00:28:46,670 608 00:28:46,670 --> 00:28:47,690 609 00:28:47,690 --> 00:28:50,150 610 00:28:50,150 --> 00:28:53,240 611 00:28:53,240 --> 00:28:55,610 612 00:28:55,610 --> 00:28:57,320 613 00:28:57,320 --> 00:28:59,540 614 00:28:59,540 --> 00:29:01,250 615 00:29:01,250 --> 00:29:03,140 616 00:29:03,140 --> 00:29:05,299 617 00:29:05,299 --> 00:29:09,890 618 00:29:09,890 --> 00:29:18,550 619 00:29:18,550 --> 00:29:22,310 620 00:29:22,310 --> 00:29:24,980 621 00:29:24,980 --> 00:29:27,950 622 00:29:27,950 --> 00:29:30,910 623 00:29:30,910 --> 00:29:33,770 624 00:29:33,770 --> 00:29:36,950 625 00:29:36,950 --> 00:29:40,070 626 00:29:40,070 --> 00:29:42,530 627 00:29:42,530 --> 00:29:45,730 628 00:29:45,730 --> 00:29:48,530 629 00:29:48,530 --> 00:29:51,230 630 00:29:51,230 --> 00:29:54,140 631 00:29:54,140 --> 00:29:57,320 632 00:29:57,320 --> 00:30:01,430 633 00:30:01,430 --> 00:30:04,190 634 00:30:04,190 --> 00:30:08,990 635 00:30:08,990 --> 00:30:11,930 636 00:30:11,930 --> 00:30:14,240 637 00:30:14,240 --> 00:30:17,810 638 00:30:17,810 --> 00:30:19,880 639 00:30:19,880 --> 00:30:22,940 640 00:30:22,940 --> 00:30:25,400 641 00:30:25,400 --> 00:30:27,200 642 00:30:27,200 --> 00:30:31,990 643 00:30:31,990 --> 00:30:35,060 644 00:30:35,060 --> 00:30:37,100 645 00:30:37,100 --> 00:30:39,320 646 00:30:39,320 --> 00:30:41,900 647 00:30:41,900 --> 00:30:45,290 648 00:30:45,290 --> 00:30:46,910 649 00:30:46,910 --> 00:30:50,540 650 00:30:50,540 --> 00:30:53,150 651 00:30:53,150 --> 00:30:55,850 652 00:30:55,850 --> 00:30:59,090 653 00:30:59,090 --> 00:31:05,120 654 00:31:05,120 --> 00:31:08,180 655 00:31:08,180 --> 00:31:11,750 656 00:31:11,750 --> 00:31:14,030 657 00:31:14,030 --> 00:31:16,610 658 00:31:16,610 --> 00:31:19,520 659 00:31:19,520 --> 00:31:21,830 660 00:31:21,830 --> 00:31:24,860 661 00:31:24,860 --> 00:31:26,630 662 00:31:26,630 --> 00:31:28,090 663 00:31:28,090 --> 00:31:31,190 664 00:31:31,190 --> 00:31:31,700 665 00:31:31,700 --> 00:31:35,150 666 00:31:35,150 --> 00:31:36,560 667 00:31:36,560 --> 00:31:39,200 668 00:31:39,200 --> 00:31:40,700 669 00:31:40,700 --> 00:31:42,680 670 00:31:42,680 --> 00:31:57,600 671 00:31:57,600 --> 00:32:00,090 672 00:32:00,090 --> 00:32:03,510 673 00:32:03,510 --> 00:32:03,520 674 00:32:03,520 --> 00:32:03,960 675 00:32:03,960 --> 00:32:06,240 676 00:32:06,240 --> 00:32:09,210 677 00:32:09,210 --> 00:32:12,030 678 00:32:12,030 --> 00:32:15,480 679 00:32:15,480 --> 00:32:18,110 680 00:32:18,110 --> 00:32:21,090 681 00:32:21,090 --> 00:32:24,480 682 00:32:24,480 --> 00:32:26,790 683 00:32:26,790 --> 00:32:30,300 684 00:32:30,300 --> 00:32:32,730 685 00:32:32,730 --> 00:32:35,490 686 00:32:35,490 --> 00:32:38,400 687 00:32:38,400 --> 00:32:41,100 688 00:32:41,100 --> 00:32:43,980 689 00:32:43,980 --> 00:32:49,080 690 00:32:49,080 --> 00:32:52,940 691 00:32:52,940 --> 00:32:55,410 692 00:32:55,410 --> 00:32:58,140 693 00:32:58,140 --> 00:33:02,460 694 00:33:02,460 --> 00:33:04,260 695 00:33:04,260 --> 00:33:06,330 696 00:33:06,330 --> 00:33:08,460 697 00:33:08,460 --> 00:33:14,220 698 00:33:14,220 --> 00:33:19,560 699 00:33:19,560 --> 00:33:21,660 700 00:33:21,660 --> 00:33:23,550 701 00:33:23,550 --> 00:33:25,800 702 00:33:25,800 --> 00:33:28,680 703 00:33:28,680 --> 00:33:30,750 704 00:33:30,750 --> 00:33:32,970 705 00:33:32,970 --> 00:33:35,040 706 00:33:35,040 --> 00:33:37,110 707 00:33:37,110 --> 00:33:41,940 708 00:33:41,940 --> 00:33:48,900 709 00:33:48,900 --> 00:33:51,840 710 00:33:51,840 --> 00:33:54,450 711 00:33:54,450 --> 00:33:59,610 712 00:33:59,610 --> 00:34:01,710 713 00:34:01,710 --> 00:34:08,629 714 00:34:08,629 --> 00:34:11,460 715 00:34:11,460 --> 00:34:13,559 716 00:34:13,559 --> 00:34:24,830 717 00:34:24,830 --> 00:34:28,159 718 00:34:28,159 --> 00:34:30,480 719 00:34:30,480 --> 00:34:35,100 720 00:34:35,100 --> 00:34:38,340 721 00:34:38,340 --> 00:34:41,639 722 00:34:41,639 --> 00:34:44,550 723 00:34:44,550 --> 00:34:47,100 724 00:34:47,100 --> 00:34:49,109 725 00:34:49,109 --> 00:34:51,389 726 00:34:51,389 --> 00:34:55,440 727 00:34:55,440 --> 00:34:57,450 728 00:34:57,450 --> 00:35:00,810 729 00:35:00,810 --> 00:35:05,400 730 00:35:05,400 --> 00:35:07,590 731 00:35:07,590 --> 00:35:11,790 732 00:35:11,790 --> 00:35:14,270 733 00:35:14,270 --> 00:35:16,560 734 00:35:16,560 --> 00:35:18,750 735 00:35:18,750 --> 00:35:21,990 736 00:35:21,990 --> 00:35:25,820 737 00:35:25,820 --> 00:35:29,550 738 00:35:29,550 --> 00:35:31,380 739 00:35:31,380 --> 00:35:37,710 740 00:35:37,710 --> 00:35:40,360 741 00:35:40,360 --> 00:35:43,180 742 00:35:43,180 --> 00:35:45,610 743 00:35:45,610 --> 00:35:49,450 744 00:35:49,450 --> 00:35:53,050 745 00:35:53,050 --> 00:36:00,730 746 00:36:00,730 --> 00:36:06,720 747 00:36:06,720 --> 00:36:09,220 748 00:36:09,220 --> 00:36:11,800 749 00:36:11,800 --> 00:36:13,870 750 00:36:13,870 --> 00:36:15,970 751 00:36:15,970 --> 00:36:19,060 752 00:36:19,060 --> 00:36:21,870 753 00:36:21,870 --> 00:36:24,550 754 00:36:24,550 --> 00:36:26,470 755 00:36:26,470 --> 00:36:28,570 756 00:36:28,570 --> 00:36:30,520 757 00:36:30,520 --> 00:36:31,840 758 00:36:31,840 --> 00:36:34,000 759 00:36:34,000 --> 00:36:41,440 760 00:36:41,440 --> 00:36:44,200 761 00:36:44,200 --> 00:36:47,200 762 00:36:47,200 --> 00:36:49,660 763 00:36:49,660 --> 00:36:51,820 764 00:36:51,820 --> 00:36:55,540 765 00:36:55,540 --> 00:36:57,970 766 00:36:57,970 --> 00:37:03,160 767 00:37:03,160 --> 00:37:05,230 768 00:37:05,230 --> 00:37:08,170 769 00:37:08,170 --> 00:37:15,100 770 00:37:15,100 --> 00:37:17,050 771 00:37:17,050 --> 00:37:21,580 772 00:37:21,580 --> 00:37:23,500 773 00:37:23,500 --> 00:37:25,990 774 00:37:25,990 --> 00:37:29,470 775 00:37:29,470 --> 00:37:35,200 776 00:37:35,200 --> 00:37:38,620 777 00:37:38,620 --> 00:37:41,650 778 00:37:41,650 --> 00:37:43,900 779 00:37:43,900 --> 00:37:46,630 780 00:37:46,630 --> 00:37:49,360 781 00:37:49,360 --> 00:37:51,430 782 00:37:51,430 --> 00:37:53,440 783 00:37:53,440 --> 00:37:57,460 784 00:37:57,460 --> 00:37:59,710 785 00:37:59,710 --> 00:38:05,440 786 00:38:05,440 --> 00:38:06,730 787 00:38:06,730 --> 00:38:07,750 788 00:38:07,750 --> 00:38:10,150 789 00:38:10,150 --> 00:38:12,190 790 00:38:12,190 --> 00:38:14,589 791 00:38:14,589 --> 00:38:18,940 792 00:38:18,940 --> 00:38:21,970 793 00:38:21,970 --> 00:38:24,819 794 00:38:24,819 --> 00:38:28,120 795 00:38:28,120 --> 00:38:30,700 796 00:38:30,700 --> 00:38:33,609 797 00:38:33,609 --> 00:38:35,710 798 00:38:35,710 --> 00:38:38,920 799 00:38:38,920 --> 00:38:42,970 800 00:38:42,970 --> 00:38:45,099 801 00:38:45,099 --> 00:38:47,890 802 00:38:47,890 --> 00:38:50,230 803 00:38:50,230 --> 00:38:52,089 804 00:38:52,089 --> 00:38:53,440 805 00:38:53,440 --> 00:38:55,690 806 00:38:55,690 --> 00:38:59,650 807 00:38:59,650 --> 00:39:01,599 808 00:39:01,599 --> 00:39:05,380 809 00:39:05,380 --> 00:39:07,809 810 00:39:07,809 --> 00:39:10,480 811 00:39:10,480 --> 00:39:13,120 812 00:39:13,120 --> 00:39:14,680 813 00:39:14,680 --> 00:39:18,160 814 00:39:18,160 --> 00:39:19,890 815 00:39:19,890 --> 00:39:25,990 816 00:39:25,990 --> 00:39:27,849 817 00:39:27,849 --> 00:39:30,220 818 00:39:30,220 --> 00:39:33,010 819 00:39:33,010 --> 00:39:34,870 820 00:39:34,870 --> 00:39:39,760 821 00:39:39,760 --> 00:39:41,589 822 00:39:41,589 --> 00:39:43,359 823 00:39:43,359 --> 00:39:46,630 824 00:39:46,630 --> 00:39:53,819 825 00:39:53,819 --> 00:39:53,829 826 00:39:53,829 --> 00:39:55,850 827 00:39:55,850 --> 00:39:58,200 828 00:39:58,200 --> 00:39:59,820 829 00:39:59,820 --> 00:40:05,130 830 00:40:05,130 --> 00:40:07,650 831 00:40:07,650 --> 00:40:10,590 832 00:40:10,590 --> 00:40:13,230 833 00:40:13,230 --> 00:40:18,420 834 00:40:18,420 --> 00:40:20,580 835 00:40:20,580 --> 00:40:25,560 836 00:40:25,560 --> 00:40:27,750 837 00:40:27,750 --> 00:40:30,210 838 00:40:30,210 --> 00:40:37,370 839 00:40:37,370 --> 00:40:42,990 840 00:40:42,990 --> 00:40:46,290 841 00:40:46,290 --> 00:40:48,120 842 00:40:48,120 --> 00:40:50,520 843 00:40:50,520 --> 00:40:54,660 844 00:40:54,660 --> 00:40:57,480 845 00:40:57,480 --> 00:40:59,490 846 00:40:59,490 --> 00:41:02,220 847 00:41:02,220 --> 00:41:04,470 848 00:41:04,470 --> 00:41:06,960 849 00:41:06,960 --> 00:41:10,470 850 00:41:10,470 --> 00:41:14,340 851 00:41:14,340 --> 00:41:16,020 852 00:41:16,020 --> 00:41:18,030 853 00:41:18,030 --> 00:41:20,160 854 00:41:20,160 --> 00:41:24,330 855 00:41:24,330 --> 00:41:29,460 856 00:41:29,460 --> 00:41:32,760 857 00:41:32,760 --> 00:41:36,720 858 00:41:36,720 --> 00:41:39,750 859 00:41:39,750 --> 00:41:43,860 860 00:41:43,860 --> 00:41:48,210 861 00:41:48,210 --> 00:41:50,820 862 00:41:50,820 --> 00:41:56,890 863 00:41:56,890 --> 00:41:59,930 864 00:41:59,930 --> 00:42:02,390 865 00:42:02,390 --> 00:42:04,160 866 00:42:04,160 --> 00:42:10,900 867 00:42:10,900 --> 00:42:14,630 868 00:42:14,630 --> 00:42:19,010 869 00:42:19,010 --> 00:42:21,350 870 00:42:21,350 --> 00:42:23,570 871 00:42:23,570 --> 00:42:26,480 872 00:42:26,480 --> 00:42:32,750 873 00:42:32,750 --> 00:42:34,820 874 00:42:34,820 --> 00:42:34,830 875 00:42:34,830 --> 00:42:48,050 876 00:42:48,050 --> 00:42:50,840 877 00:42:50,840 --> 00:42:53,840 878 00:42:53,840 --> 00:42:59,480 879 00:42:59,480 --> 00:43:06,260 880 00:43:06,260 --> 00:43:09,170 881 00:43:09,170 --> 00:43:13,010 882 00:43:13,010 --> 00:43:15,350 883 00:43:15,350 --> 00:43:18,110 884 00:43:18,110 --> 00:43:22,490 885 00:43:22,490 --> 00:43:23,960 886 00:43:23,960 --> 00:43:26,780 887 00:43:26,780 --> 00:43:28,460 888 00:43:28,460 --> 00:43:31,610 889 00:43:31,610 --> 00:43:34,570 890 00:43:34,570 --> 00:43:37,730 891 00:43:37,730 --> 00:43:40,670 892 00:43:40,670 --> 00:43:43,700 893 00:43:43,700 --> 00:43:46,490 894 00:43:46,490 --> 00:43:49,070 895 00:43:49,070 --> 00:43:53,300 896 00:43:53,300 --> 00:43:56,920 897 00:43:56,920 --> 00:44:00,520 898 00:44:00,520 --> 00:44:04,840 899 00:44:04,840 --> 00:44:08,440 900 00:44:08,440 --> 00:44:10,330 901 00:44:10,330 --> 00:44:12,820 902 00:44:12,820 --> 00:44:15,340 903 00:44:15,340 --> 00:44:20,740 904 00:44:20,740 --> 00:44:22,720 905 00:44:22,720 --> 00:44:25,270 906 00:44:25,270 --> 00:44:26,560 907 00:44:26,560 --> 00:44:28,420 908 00:44:28,420 --> 00:44:30,010 909 00:44:30,010 --> 00:44:32,710 910 00:44:32,710 --> 00:44:35,650 911 00:44:35,650 --> 00:44:38,680 912 00:44:38,680 --> 00:44:43,450 913 00:44:43,450 --> 00:44:45,490 914 00:44:45,490 --> 00:44:48,820 915 00:44:48,820 --> 00:44:50,440 916 00:44:50,440 --> 00:44:52,360 917 00:44:52,360 --> 00:44:53,500 918 00:44:53,500 --> 00:44:55,870 919 00:44:55,870 --> 00:45:06,550 920 00:45:06,550 --> 00:45:08,350 921 00:45:08,350 --> 00:45:11,140 922 00:45:11,140 --> 00:45:13,210 923 00:45:13,210 --> 00:45:17,740 924 00:45:17,740 --> 00:45:19,390 925 00:45:19,390 --> 00:45:24,220 926 00:45:24,220 --> 00:45:27,040 927 00:45:27,040 --> 00:45:30,760 928 00:45:30,760 --> 00:45:35,470 929 00:45:35,470 --> 00:45:39,430 930 00:45:39,430 --> 00:45:41,170 931 00:45:41,170 --> 00:45:48,870 932 00:45:48,870 --> 00:45:51,640 933 00:45:51,640 --> 00:45:56,980 934 00:45:56,980 --> 00:45:59,140 935 00:45:59,140 --> 00:46:03,340 936 00:46:03,340 --> 00:46:08,079 937 00:46:08,079 --> 00:46:10,519 938 00:46:10,519 --> 00:46:12,679 939 00:46:12,679 --> 00:46:13,339 940 00:46:13,339 --> 00:46:15,709 941 00:46:15,709 --> 00:46:18,140 942 00:46:18,140 --> 00:46:18,150 943 00:46:18,150 --> 00:46:26,790 944 00:46:26,790 --> 00:46:29,940 945 00:46:29,940 --> 00:46:37,550 946 00:46:37,550 --> 00:46:41,250 947 00:46:41,250 --> 00:46:43,980 948 00:46:43,980 --> 00:46:45,990 949 00:46:45,990 --> 00:46:49,860 950 00:46:49,860 --> 00:46:51,360 951 00:46:51,360 --> 00:46:53,610 952 00:46:53,610 --> 00:46:55,500 953 00:46:55,500 --> 00:46:58,140 954 00:46:58,140 --> 00:47:00,300 955 00:47:00,300 --> 00:47:03,030 956 00:47:03,030 --> 00:47:06,180 957 00:47:06,180 --> 00:47:10,800 958 00:47:10,800 --> 00:47:14,250 959 00:47:14,250 --> 00:47:16,980 960 00:47:16,980 --> 00:47:21,630 961 00:47:21,630 --> 00:47:24,120 962 00:47:24,120 --> 00:47:26,550 963 00:47:26,550 --> 00:47:28,560 964 00:47:28,560 --> 00:47:31,940 965 00:47:31,940 --> 00:47:36,540 966 00:47:36,540 --> 00:47:39,810 967 00:47:39,810 --> 00:47:43,980 968 00:47:43,980 --> 00:47:47,610 969 00:47:47,610 --> 00:47:50,820 970 00:47:50,820 --> 00:47:54,570 971 00:47:54,570 --> 00:47:56,580 972 00:47:56,580 --> 00:48:00,300 973 00:48:00,300 --> 00:48:02,430 974 00:48:02,430 --> 00:48:05,520 975 00:48:05,520 --> 00:48:08,910 976 00:48:08,910 --> 00:48:13,560 977 00:48:13,560 --> 00:48:15,990 978 00:48:15,990 --> 00:48:19,230 979 00:48:19,230 --> 00:48:22,830 980 00:48:22,830 --> 00:48:26,070 981 00:48:26,070 --> 00:48:32,160 982 00:48:32,160 --> 00:48:33,660 983 00:48:33,660 --> 00:48:36,990 984 00:48:36,990 --> 00:48:38,430 985 00:48:38,430 --> 00:48:43,770 986 00:48:43,770 --> 00:48:48,090 987 00:48:48,090 --> 00:48:50,490 988 00:48:50,490 --> 00:48:54,060 989 00:48:54,060 --> 00:48:56,000 990 00:48:56,000 --> 00:48:59,490 991 00:48:59,490 --> 00:49:01,380 992 00:49:01,380 --> 00:49:03,030 993 00:49:03,030 --> 00:49:04,680 994 00:49:04,680 --> 00:49:09,360 995 00:49:09,360 --> 00:49:13,800 996 00:49:13,800 --> 00:49:18,810 997 00:49:18,810 --> 00:49:20,940 998 00:49:20,940 --> 00:49:24,150 999 00:49:24,150 --> 00:49:25,710 1000 00:49:25,710 --> 00:49:28,230 1001 00:49:28,230 --> 00:49:33,960 1002 00:49:33,960 --> 00:49:38,340 1003 00:49:38,340 --> 00:49:42,060 1004 00:49:42,060 --> 00:49:44,640 1005 00:49:44,640 --> 00:49:46,620 1006 00:49:46,620 --> 00:49:48,350 1007 00:49:48,350 --> 00:49:50,880 1008 00:49:50,880 --> 00:49:53,820 1009 00:49:53,820 --> 00:49:56,310 1010 00:49:56,310 --> 00:49:57,990 1011 00:49:57,990 --> 00:50:00,900 1012 00:50:00,900 --> 00:50:07,440 1013 00:50:07,440 --> 00:50:16,089 1014 00:50:16,089 --> 00:50:18,549 1015 00:50:18,549 --> 00:50:21,940 1016 00:50:21,940 --> 00:50:24,430 1017 00:50:24,430 --> 00:50:28,180 1018 00:50:28,180 --> 00:50:30,250 1019 00:50:30,250 --> 00:50:32,799 1020 00:50:32,799 --> 00:50:37,900 1021 00:50:37,900 --> 00:50:39,670 1022 00:50:39,670 --> 00:50:45,549 1023 00:50:45,549 --> 00:50:57,910 1024 00:50:57,910 --> 00:51:00,670 1025 00:51:00,670 --> 00:51:09,010 1026 00:51:09,010 --> 00:51:11,230 1027 00:51:11,230 --> 00:51:12,549 1028 00:51:12,549 --> 00:51:14,410 1029 00:51:14,410 --> 00:51:17,079 1030 00:51:17,079 --> 00:51:19,329 1031 00:51:19,329 --> 00:51:30,100 1032 00:51:30,100 --> 00:51:32,380 1033 00:51:32,380 --> 00:51:34,240 1034 00:51:34,240 --> 00:51:35,920 1035 00:51:35,920 --> 00:51:37,720 1036 00:51:37,720 --> 00:51:39,820 1037 00:51:39,820 --> 00:51:42,910 1038 00:51:42,910 --> 00:51:45,790 1039 00:51:45,790 --> 00:51:48,790 1040 00:51:48,790 --> 00:51:51,220 1041 00:51:51,220 --> 00:51:53,320 1042 00:51:53,320 --> 00:51:57,520 1043 00:51:57,520 --> 00:51:59,680 1044 00:51:59,680 --> 00:52:01,720 1045 00:52:01,720 --> 00:52:05,140 1046 00:52:05,140 --> 00:52:10,720 1047 00:52:10,720 --> 00:52:13,450 1048 00:52:13,450 --> 00:52:15,190 1049 00:52:15,190 --> 00:52:18,520 1050 00:52:18,520 --> 00:52:20,140 1051 00:52:20,140 --> 00:52:23,530 1052 00:52:23,530 --> 00:52:29,770 1053 00:52:29,770 --> 00:52:31,630 1054 00:52:31,630 --> 00:52:35,200 1055 00:52:35,200 --> 00:52:37,990 1056 00:52:37,990 --> 00:52:42,220 1057 00:52:42,220 --> 00:52:43,870 1058 00:52:43,870 --> 00:52:46,750 1059 00:52:46,750 --> 00:52:48,790 1060 00:52:48,790 --> 00:52:50,470 1061 00:52:50,470 --> 00:52:54,880 1062 00:52:54,880 --> 00:52:58,360 1063 00:52:58,360 --> 00:53:00,670 1064 00:53:00,670 --> 00:53:03,180 1065 00:53:03,180 --> 00:53:05,680 1066 00:53:05,680 --> 00:53:09,010 1067 00:53:09,010 --> 00:53:10,810 1068 00:53:10,810 --> 00:53:13,030 1069 00:53:13,030 --> 00:53:14,590 1070 00:53:14,590 --> 00:53:16,330 1071 00:53:16,330 --> 00:53:18,490 1072 00:53:18,490 --> 00:53:20,170 1073 00:53:20,170 --> 00:53:31,109 1074 00:53:31,109 --> 00:53:33,690 1075 00:53:33,690 --> 00:53:38,069 1076 00:53:38,069 --> 00:53:39,690 1077 00:53:39,690 --> 00:53:46,229 1078 00:53:46,229 --> 00:53:48,900 1079 00:53:48,900 --> 00:53:55,469 1080 00:53:55,469 --> 00:54:02,130 1081 00:54:02,130 --> 00:54:05,579 1082 00:54:05,579 --> 00:54:10,019 1083 00:54:10,019 --> 00:54:12,180 1084 00:54:12,180 --> 00:54:14,069 1085 00:54:14,069 --> 00:54:16,140 1086 00:54:16,140 --> 00:54:20,849 1087 00:54:20,849 --> 00:54:23,489 1088 00:54:23,489 --> 00:54:25,469 1089 00:54:25,469 --> 00:54:27,779 1090 00:54:27,779 --> 00:54:31,349 1091 00:54:31,349 --> 00:54:34,529 1092 00:54:34,529 --> 00:54:36,209 1093 00:54:36,209 --> 00:54:38,249 1094 00:54:38,249 --> 00:54:41,690 1095 00:54:41,690 --> 00:54:44,160 1096 00:54:44,160 --> 00:54:50,880 1097 00:54:50,880 --> 00:54:55,620 1098 00:54:55,620 --> 00:54:58,499 1099 00:54:58,499 --> 00:55:01,200 1100 00:55:01,200 --> 00:55:03,209 1101 00:55:03,209 --> 00:55:06,559 1102 00:55:06,559 --> 00:55:09,180 1103 00:55:09,180 --> 00:55:11,640 1104 00:55:11,640 --> 00:55:13,529 1105 00:55:13,529 --> 00:55:18,870 1106 00:55:18,870 --> 00:55:21,769 1107 00:55:21,769 --> 00:55:24,809 1108 00:55:24,809 --> 00:55:27,150 1109 00:55:27,150 --> 00:55:30,269 1110 00:55:30,269 --> 00:55:33,299 1111 00:55:33,299 --> 00:55:37,920 1112 00:55:37,920 --> 00:55:40,559 1113 00:55:40,559 --> 00:55:40,569 1114 00:55:40,569 --> 00:55:41,400 1115 00:55:41,400 --> 00:55:44,010 1116 00:55:44,010 --> 00:55:49,160 1117 00:55:49,160 --> 00:55:52,250 1118 00:55:52,250 --> 00:55:55,019 1119 00:55:55,019 --> 00:55:56,970 1120 00:55:56,970 --> 00:55:58,980 1121 00:55:58,980 --> 00:56:03,240 1122 00:56:03,240 --> 00:56:06,210 1123 00:56:06,210 --> 00:56:06,220 1124 00:56:06,220 --> 00:56:06,930 1125 00:56:06,930 --> 00:56:09,120 1126 00:56:09,120 --> 00:56:11,519 1127 00:56:11,519 --> 00:56:11,529 1128 00:56:11,529 --> 00:56:12,120 1129 00:56:12,120 --> 00:56:14,279 1130 00:56:14,279 --> 00:56:16,440 1131 00:56:16,440 --> 00:56:21,630 1132 00:56:21,630 --> 00:56:23,370 1133 00:56:23,370 --> 00:56:26,250 1134 00:56:26,250 --> 00:56:27,900 1135 00:56:27,900 --> 00:56:30,480 1136 00:56:30,480 --> 00:56:32,700 1137 00:56:32,700 --> 00:56:36,329 1138 00:56:36,329 --> 00:56:40,650 1139 00:56:40,650 --> 00:56:43,769 1140 00:56:43,769 --> 00:56:48,990 1141 00:56:48,990 --> 00:56:50,760 1142 00:56:50,760 --> 00:56:52,380 1143 00:56:52,380 --> 00:56:54,750 1144 00:56:54,750 --> 00:56:56,849 1145 00:56:56,849 --> 00:57:00,599 1146 00:57:00,599 --> 00:57:05,579 1147 00:57:05,579 --> 00:57:06,870 1148 00:57:06,870 --> 00:57:09,210 1149 00:57:09,210 --> 00:57:11,549 1150 00:57:11,549 --> 00:57:13,140 1151 00:57:13,140 --> 00:57:14,970 1152 00:57:14,970 --> 00:57:17,700 1153 00:57:17,700 --> 00:57:19,890 1154 00:57:19,890 --> 00:57:23,849 1155 00:57:23,849 --> 00:57:26,309 1156 00:57:26,309 --> 00:57:30,269 1157 00:57:30,269 --> 00:57:32,579 1158 00:57:32,579 --> 00:57:34,289 1159 00:57:34,289 --> 00:57:36,029 1160 00:57:36,029 --> 00:57:40,349 1161 00:57:40,349 --> 00:57:42,390 1162 00:57:42,390 --> 00:57:45,890 1163 00:57:45,890 --> 00:57:48,720 1164 00:57:48,720 --> 00:57:50,789 1165 00:57:50,789 --> 00:57:51,589 1166 00:57:51,589 --> 00:57:53,809 1167 00:57:53,809 --> 00:57:57,170 1168 00:57:57,170 --> 00:57:59,420 1169 00:57:59,420 --> 00:58:01,640 1170 00:58:01,640 --> 00:58:03,199 1171 00:58:03,199 --> 00:58:05,630 1172 00:58:05,630 --> 00:58:10,759 1173 00:58:10,759 --> 00:58:14,299 1174 00:58:14,299 --> 00:58:25,549 1175 00:58:25,549 --> 00:58:28,130 1176 00:58:28,130 --> 00:58:31,400 1177 00:58:31,400 --> 00:58:33,170 1178 00:58:33,170 --> 00:58:36,529 1179 00:58:36,529 --> 00:58:38,900 1180 00:58:38,900 --> 00:58:40,819 1181 00:58:40,819 --> 00:58:42,739 1182 00:58:42,739 --> 00:58:44,989 1183 00:58:44,989 --> 00:58:47,809 1184 00:58:47,809 --> 00:58:50,269 1185 00:58:50,269 --> 00:58:52,999 1186 00:58:52,999 --> 00:58:57,289 1187 00:58:57,289 --> 00:58:59,029 1188 00:58:59,029 --> 00:59:00,650 1189 00:59:00,650 --> 00:59:02,870 1190 00:59:02,870 --> 00:59:04,489 1191 00:59:04,489 --> 00:59:07,430 1192 00:59:07,430 --> 00:59:09,829 1193 00:59:09,829 --> 00:59:15,799 1194 00:59:15,799 --> 00:59:17,689 1195 00:59:17,689 --> 00:59:20,719 1196 00:59:20,719 --> 00:59:22,989 1197 00:59:22,989 --> 00:59:25,880 1198 00:59:25,880 --> 00:59:27,890 1199 00:59:27,890 --> 00:59:30,589 1200 00:59:30,589 --> 00:59:31,930 1201 00:59:31,930 --> 00:59:34,370 1202 00:59:34,370 --> 00:59:36,349 1203 00:59:36,349 --> 00:59:38,329 1204 00:59:38,329 --> 00:59:39,920 1205 00:59:39,920 --> 00:59:42,140 1206 00:59:42,140 --> 00:59:44,029 1207 00:59:44,029 --> 00:59:48,259 1208 00:59:48,259 --> 00:59:51,380 1209 00:59:51,380 --> 00:59:56,689 1210 00:59:56,689 --> 00:59:58,880 1211 00:59:58,880 --> 01:00:02,959 1212 01:00:02,959 --> 01:00:05,180 1213 01:00:05,180 --> 01:00:05,190 1214 01:00:05,190 --> 01:00:05,599 1215 01:00:05,599 --> 01:00:08,150 1216 01:00:08,150 --> 01:00:12,019 1217 01:00:12,019 --> 01:00:13,819 1218 01:00:13,819 --> 01:00:16,130 1219 01:00:16,130 --> 01:00:18,319 1220 01:00:18,319 --> 01:00:21,019 1221 01:00:21,019 --> 01:00:22,309 1222 01:00:22,309 --> 01:00:24,680 1223 01:00:24,680 --> 01:00:26,390 1224 01:00:26,390 --> 01:00:27,710 1225 01:00:27,710 --> 01:00:29,390 1226 01:00:29,390 --> 01:00:33,559 1227 01:00:33,559 --> 01:00:42,319 1228 01:00:42,319 --> 01:00:44,599 1229 01:00:44,599 --> 01:00:47,269 1230 01:00:47,269 --> 01:00:53,359 1231 01:00:53,359 --> 01:00:56,989 1232 01:00:56,989 --> 01:01:00,490 1233 01:01:00,490 --> 01:01:03,650 1234 01:01:03,650 --> 01:01:03,660 1235 01:01:03,660 --> 01:01:07,020 1236 01:01:07,020 --> 01:01:11,940 1237 01:01:11,940 --> 01:01:23,580 1238 01:01:23,580 --> 01:01:26,010 1239 01:01:26,010 --> 01:01:27,930 1240 01:01:27,930 --> 01:01:29,339 1241 01:01:29,339 --> 01:01:36,320 1242 01:01:36,320 --> 01:01:39,500 1243 01:01:39,500 --> 01:01:40,980 1244 01:01:40,980 --> 01:01:42,390 1245 01:01:42,390 --> 01:01:45,510 1246 01:01:45,510 --> 01:01:46,770 1247 01:01:46,770 --> 01:01:48,630 1248 01:01:48,630 --> 01:01:56,280 1249 01:01:56,280 --> 01:01:56,290 1250 01:01:56,290 --> 01:01:59,840 1251 01:01:59,840 --> 01:02:04,370 1252 01:02:04,370 --> 01:02:05,900 1253 01:02:05,900 --> 01:02:07,460 1254 01:02:07,460 --> 01:02:09,440 1255 01:02:09,440 --> 01:02:11,200 1256 01:02:11,200 --> 01:02:13,550 1257 01:02:13,550 --> 01:02:16,220 1258 01:02:16,220 --> 01:02:19,760 1259 01:02:19,760 --> 01:02:21,680 1260 01:02:21,680 --> 01:02:25,340 1261 01:02:25,340 --> 01:02:26,870 1262 01:02:26,870 --> 01:02:28,580 1263 01:02:28,580 --> 01:02:32,240 1264 01:02:32,240 --> 01:02:35,240 1265 01:02:35,240 --> 01:02:37,880 1266 01:02:37,880 --> 01:02:40,610 1267 01:02:40,610 --> 01:02:42,500 1268 01:02:42,500 --> 01:02:45,620 1269 01:02:45,620 --> 01:02:48,680 1270 01:02:48,680 --> 01:02:54,020 1271 01:02:54,020 --> 01:02:56,080 1272 01:02:56,080 --> 01:03:00,230 1273 01:03:00,230 --> 01:03:04,160 1274 01:03:04,160 --> 01:03:08,300 1275 01:03:08,300 --> 01:03:10,790 1276 01:03:10,790 --> 01:03:13,690 1277 01:03:13,690 --> 01:03:17,120 1278 01:03:17,120 --> 01:03:18,500 1279 01:03:18,500 --> 01:03:23,020 1280 01:03:23,020 --> 01:03:26,300 1281 01:03:26,300 --> 01:03:28,070 1282 01:03:28,070 --> 01:03:32,300 1283 01:03:32,300 --> 01:03:34,070 1284 01:03:34,070 --> 01:03:37,340 1285 01:03:37,340 --> 01:03:39,260 1286 01:03:39,260 --> 01:03:41,510 1287 01:03:41,510 --> 01:03:43,640 1288 01:03:43,640 --> 01:03:46,450 1289 01:03:46,450 --> 01:03:49,520 1290 01:03:49,520 --> 01:03:51,500 1291 01:03:51,500 --> 01:03:53,210 1292 01:03:53,210 --> 01:03:55,400 1293 01:03:55,400 --> 01:03:59,810 1294 01:03:59,810 --> 01:04:02,990 1295 01:04:02,990 --> 01:04:07,070 1296 01:04:07,070 --> 01:04:10,760 1297 01:04:10,760 --> 01:04:12,930 1298 01:04:12,930 --> 01:04:16,380 1299 01:04:16,380 --> 01:04:18,210 1300 01:04:18,210 --> 01:04:21,380 1301 01:04:21,380 --> 01:04:25,319 1302 01:04:25,319 --> 01:04:28,620 1303 01:04:28,620 --> 01:04:35,010 1304 01:04:35,010 --> 01:04:39,059 1305 01:04:39,059 --> 01:04:42,180 1306 01:04:42,180 --> 01:04:45,030 1307 01:04:45,030 --> 01:04:48,359 1308 01:04:48,359 --> 01:04:54,270 1309 01:04:54,270 --> 01:04:56,790 1310 01:04:56,790 --> 01:05:02,359 1311 01:05:02,359 --> 01:05:05,609 1312 01:05:05,609 --> 01:05:09,839 1313 01:05:09,839 --> 01:05:14,339 1314 01:05:14,339 --> 01:05:19,940 1315 01:05:19,940 --> 01:05:22,200 1316 01:05:22,200 --> 01:05:25,050 1317 01:05:25,050 --> 01:05:26,819 1318 01:05:26,819 --> 01:05:29,640 1319 01:05:29,640 --> 01:05:34,109 1320 01:05:34,109 --> 01:05:35,910 1321 01:05:35,910 --> 01:05:38,819 1322 01:05:38,819 --> 01:05:41,640 1323 01:05:41,640 --> 01:05:43,559 1324 01:05:43,559 --> 01:05:45,270 1325 01:05:45,270 --> 01:05:47,220 1326 01:05:47,220 --> 01:05:48,900 1327 01:05:48,900 --> 01:05:50,370 1328 01:05:50,370 --> 01:05:54,180 1329 01:05:54,180 --> 01:05:56,609 1330 01:05:56,609 --> 01:05:59,790 1331 01:05:59,790 --> 01:06:03,390 1332 01:06:03,390 --> 01:06:10,800 1333 01:06:10,800 --> 01:06:15,640 1334 01:06:15,640 --> 01:06:15,650 1335 01:06:15,650 --> 01:06:17,100 1336 01:06:17,100 --> 01:06:19,900 1337 01:06:19,900 --> 01:06:22,060 1338 01:06:22,060 --> 01:06:26,200 1339 01:06:26,200 --> 01:06:28,180 1340 01:06:28,180 --> 01:06:30,220 1341 01:06:30,220 --> 01:06:32,980 1342 01:06:32,980 --> 01:06:35,890 1343 01:06:35,890 --> 01:06:41,470 1344 01:06:41,470 --> 01:06:44,890 1345 01:06:44,890 --> 01:06:48,070 1346 01:06:48,070 --> 01:06:52,660 1347 01:06:52,660 --> 01:06:54,550 1348 01:06:54,550 --> 01:06:57,070 1349 01:06:57,070 --> 01:06:59,110 1350 01:06:59,110 --> 01:07:02,650 1351 01:07:02,650 --> 01:07:04,450 1352 01:07:04,450 --> 01:07:06,520 1353 01:07:06,520 --> 01:07:08,350 1354 01:07:08,350 --> 01:07:11,140 1355 01:07:11,140 --> 01:07:14,320 1356 01:07:14,320 --> 01:07:16,330 1357 01:07:16,330 --> 01:07:20,530 1358 01:07:20,530 --> 01:07:21,910 1359 01:07:21,910 --> 01:07:26,350 1360 01:07:26,350 --> 01:07:27,910 1361 01:07:27,910 --> 01:07:31,680 1362 01:07:31,680 --> 01:07:34,780 1363 01:07:34,780 --> 01:07:37,480 1364 01:07:37,480 --> 01:07:39,550 1365 01:07:39,550 --> 01:07:42,070 1366 01:07:42,070 --> 01:07:43,750 1367 01:07:43,750 --> 01:07:45,490 1368 01:07:45,490 --> 01:07:48,490 1369 01:07:48,490 --> 01:07:50,020 1370 01:07:50,020 --> 01:07:52,540 1371 01:07:52,540 --> 01:07:56,080 1372 01:07:56,080 --> 01:07:58,600 1373 01:07:58,600 --> 01:08:00,310 1374 01:08:00,310 --> 01:08:03,940 1375 01:08:03,940 --> 01:08:06,460 1376 01:08:06,460 --> 01:08:09,610 1377 01:08:09,610 --> 01:08:11,890 1378 01:08:11,890 --> 01:08:15,760 1379 01:08:15,760 --> 01:08:17,710 1380 01:08:17,710 --> 01:08:19,240 1381 01:08:19,240 --> 01:08:20,740 1382 01:08:20,740 --> 01:08:24,099 1383 01:08:24,099 --> 01:08:25,899 1384 01:08:25,899 --> 01:08:28,300 1385 01:08:28,300 --> 01:08:29,919 1386 01:08:29,919 --> 01:08:34,749 1387 01:08:34,749 --> 01:08:36,579 1388 01:08:36,579 --> 01:08:40,599 1389 01:08:40,599 --> 01:08:42,820 1390 01:08:42,820 --> 01:08:44,620 1391 01:08:44,620 --> 01:08:46,149 1392 01:08:46,149 --> 01:08:48,490 1393 01:08:48,490 --> 01:08:51,820 1394 01:08:51,820 --> 01:08:55,450 1395 01:08:55,450 --> 01:08:58,359 1396 01:08:58,359 --> 01:09:02,349 1397 01:09:02,349 --> 01:09:05,200 1398 01:09:05,200 --> 01:09:06,970 1399 01:09:06,970 --> 01:09:10,419 1400 01:09:10,419 --> 01:09:15,789 1401 01:09:15,789 --> 01:09:18,189 1402 01:09:18,189 --> 01:09:20,379 1403 01:09:20,379 --> 01:09:22,930 1404 01:09:22,930 --> 01:09:27,309 1405 01:09:27,309 --> 01:09:31,059 1406 01:09:31,059 --> 01:09:32,829 1407 01:09:32,829 --> 01:09:37,620 1408 01:09:37,620 --> 01:09:40,840 1409 01:09:40,840 --> 01:09:44,109 1410 01:09:44,109 --> 01:09:46,419 1411 01:09:46,419 --> 01:09:52,180 1412 01:09:52,180 --> 01:09:55,149 1413 01:09:55,149 --> 01:09:57,490 1414 01:09:57,490 --> 01:09:59,200 1415 01:09:59,200 --> 01:10:02,590 1416 01:10:02,590 --> 01:10:06,879 1417 01:10:06,879 --> 01:10:10,090 1418 01:10:10,090 --> 01:10:12,669 1419 01:10:12,669 --> 01:10:17,290 1420 01:10:17,290 --> 01:10:19,359 1421 01:10:19,359 --> 01:10:28,459 1422 01:10:28,459 --> 01:10:30,810 1423 01:10:30,810 --> 01:10:33,240 1424 01:10:33,240 --> 01:10:36,479 1425 01:10:36,479 --> 01:10:40,320 1426 01:10:40,320 --> 01:10:41,790 1427 01:10:41,790 --> 01:10:46,890 1428 01:10:46,890 --> 01:10:48,090 1429 01:10:48,090 --> 01:10:50,010 1430 01:10:50,010 --> 01:10:52,560 1431 01:10:52,560 --> 01:10:54,120 1432 01:10:54,120 --> 01:10:57,680 1433 01:10:57,680 --> 01:11:00,030 1434 01:11:00,030 --> 01:11:02,180 1435 01:11:02,180 --> 01:11:04,800 1436 01:11:04,800 --> 01:11:06,840 1437 01:11:06,840 --> 01:11:09,200 1438 01:11:09,200 --> 01:11:14,160 1439 01:11:14,160 --> 01:11:16,410 1440 01:11:16,410 --> 01:11:19,379 1441 01:11:19,379 --> 01:11:21,510 1442 01:11:21,510 --> 01:11:23,729 1443 01:11:23,729 --> 01:11:25,950 1444 01:11:25,950 --> 01:11:27,930 1445 01:11:27,930 --> 01:11:30,060 1446 01:11:30,060 --> 01:11:34,290 1447 01:11:34,290 --> 01:11:36,000 1448 01:11:36,000 --> 01:11:38,010 1449 01:11:38,010 --> 01:11:40,550 1450 01:11:40,550 --> 01:11:43,740 1451 01:11:43,740 --> 01:11:46,500 1452 01:11:46,500 --> 01:11:48,899 1453 01:11:48,899 --> 01:11:50,490 1454 01:11:50,490 --> 01:11:53,520 1455 01:11:53,520 --> 01:11:55,820 1456 01:11:55,820 --> 01:11:59,010 1457 01:11:59,010 --> 01:12:02,879 1458 01:12:02,879 --> 01:12:04,560 1459 01:12:04,560 --> 01:12:06,540 1460 01:12:06,540 --> 01:12:08,550 1461 01:12:08,550 --> 01:12:10,680 1462 01:12:10,680 --> 01:12:14,189 1463 01:12:14,189 --> 01:12:21,240 1464 01:12:21,240 --> 01:12:23,920 1465 01:12:23,920 --> 01:12:28,870 1466 01:12:28,870 --> 01:12:31,810 1467 01:12:31,810 --> 01:12:33,970 1468 01:12:33,970 --> 01:12:37,710 1469 01:12:37,710 --> 01:12:40,480 1470 01:12:40,480 --> 01:12:43,030 1471 01:12:43,030 --> 01:12:45,370 1472 01:12:45,370 --> 01:12:47,260 1473 01:12:47,260 --> 01:12:49,480 1474 01:12:49,480 --> 01:12:52,600 1475 01:12:52,600 --> 01:12:55,210 1476 01:12:55,210 --> 01:12:57,730 1477 01:12:57,730 --> 01:12:59,560 1478 01:12:59,560 --> 01:13:01,720 1479 01:13:01,720 --> 01:13:03,580 1480 01:13:03,580 --> 01:13:06,490 1481 01:13:06,490 --> 01:13:08,050 1482 01:13:08,050 --> 01:13:09,400 1483 01:13:09,400 --> 01:13:11,110 1484 01:13:11,110 --> 01:13:12,850 1485 01:13:12,850 --> 01:13:14,230 1486 01:13:14,230 --> 01:13:19,690 1487 01:13:19,690 --> 01:13:21,250 1488 01:13:21,250 --> 01:13:24,070 1489 01:13:24,070 --> 01:13:26,530 1490 01:13:26,530 --> 01:13:28,300 1491 01:13:28,300 --> 01:13:30,010 1492 01:13:30,010 --> 01:13:31,510 1493 01:13:31,510 --> 01:13:35,110 1494 01:13:35,110 --> 01:13:38,740 1495 01:13:38,740 --> 01:13:40,210 1496 01:13:40,210 --> 01:13:49,880 1497 01:13:49,880 --> 01:13:53,520 1498 01:13:53,520 --> 01:13:55,470 1499 01:13:55,470 --> 01:13:58,050 1500 01:13:58,050 --> 01:13:59,790 1501 01:13:59,790 --> 01:14:04,530 1502 01:14:04,530 --> 01:14:08,610 1503 01:14:08,610 --> 01:14:09,960 1504 01:14:09,960 --> 01:14:11,870 1505 01:14:11,870 --> 01:14:15,780 1506 01:14:15,780 --> 01:14:17,850 1507 01:14:17,850 --> 01:14:23,460 1508 01:14:23,460 --> 01:14:25,770 1509 01:14:25,770 --> 01:14:27,900 1510 01:14:27,900 --> 01:14:30,120 1511 01:14:30,120 --> 01:14:32,550 1512 01:14:32,550 --> 01:14:34,290 1513 01:14:34,290 --> 01:14:36,300 1514 01:14:36,300 --> 01:14:38,670 1515 01:14:38,670 --> 01:14:40,920 1516 01:14:40,920 --> 01:14:43,979 1517 01:14:43,979 --> 01:14:46,620 1518 01:14:46,620 --> 01:14:49,770 1519 01:14:49,770 --> 01:14:53,729 1520 01:14:53,729 --> 01:14:56,340 1521 01:14:56,340 --> 01:14:59,070 1522 01:14:59,070 --> 01:15:00,900 1523 01:15:00,900 --> 01:15:04,530 1524 01:15:04,530 --> 01:15:06,750 1525 01:15:06,750 --> 01:15:11,670 1526 01:15:11,670 --> 01:15:14,990 1527 01:15:14,990 --> 01:15:17,880 1528 01:15:17,880 --> 01:15:19,590 1529 01:15:19,590 --> 01:15:22,050 1530 01:15:22,050 --> 01:15:24,510 1531 01:15:24,510 --> 01:15:26,370 1532 01:15:26,370 --> 01:15:28,170 1533 01:15:28,170 --> 01:15:32,130 1534 01:15:32,130 --> 01:15:34,740 1535 01:15:34,740 --> 01:15:36,720 1536 01:15:36,720 --> 01:15:38,910 1537 01:15:38,910 --> 01:15:41,280 1538 01:15:41,280 --> 01:15:43,320 1539 01:15:43,320 --> 01:15:46,830 1540 01:15:46,830 --> 01:15:49,229 1541 01:15:49,229 --> 01:15:52,020 1542 01:15:52,020 --> 01:15:53,850 1543 01:15:53,850 --> 01:15:57,540 1544 01:15:57,540 --> 01:16:00,600 1545 01:16:00,600 --> 01:16:02,489 1546 01:16:02,489 --> 01:16:04,140 1547 01:16:04,140 --> 01:16:06,419 1548 01:16:06,419 --> 01:16:13,439 1549 01:16:13,439 --> 01:16:16,200 1550 01:16:16,200 --> 01:16:20,189 1551 01:16:20,189 --> 01:16:23,129 1552 01:16:23,129 --> 01:16:25,350 1553 01:16:25,350 --> 01:16:28,049 1554 01:16:28,049 --> 01:16:29,339 1555 01:16:29,339 --> 01:16:31,200 1556 01:16:31,200 --> 01:16:34,799 1557 01:16:34,799 --> 01:16:37,469 1558 01:16:37,469 --> 01:16:40,520 1559 01:16:40,520 --> 01:16:42,540 1560 01:16:42,540 --> 01:16:45,330 1561 01:16:45,330 --> 01:16:50,489 1562 01:16:50,489 --> 01:16:52,229 1563 01:16:52,229 --> 01:16:57,149 1564 01:16:57,149 --> 01:17:00,439 1565 01:17:00,439 --> 01:17:02,939 1566 01:17:02,939 --> 01:17:05,850 1567 01:17:05,850 --> 01:17:08,689 1568 01:17:08,689 --> 01:17:11,100 1569 01:17:11,100 --> 01:17:15,029 1570 01:17:15,029 --> 01:17:17,339 1571 01:17:17,339 --> 01:17:19,469 1572 01:17:19,469 --> 01:17:22,439 1573 01:17:22,439 --> 01:17:25,009 1574 01:17:25,009 --> 01:17:32,359 1575 01:17:32,359 --> 01:17:34,979 1576 01:17:34,979 --> 01:17:36,750 1577 01:17:36,750 --> 01:17:39,899 1578 01:17:39,899 --> 01:17:41,399 1579 01:17:41,399 --> 01:17:43,609 1580 01:17:43,609 --> 01:17:46,259 1581 01:17:46,259 --> 01:17:48,899 1582 01:17:48,899 --> 01:17:50,339 1583 01:17:50,339 --> 01:17:53,879 1584 01:17:53,879 --> 01:17:56,939 1585 01:17:56,939 --> 01:17:59,310 1586 01:17:59,310 --> 01:18:02,250 1587 01:18:02,250 --> 01:18:04,140 1588 01:18:04,140 --> 01:18:08,270 1589 01:18:08,270 --> 01:18:11,489 1590 01:18:11,489 --> 01:18:13,799 1591 01:18:13,799 --> 01:18:15,419 1592 01:18:15,419 --> 01:18:15,429 1593 01:18:15,429 --> 01:18:15,930 1594 01:18:15,930 --> 01:18:18,600 1595 01:18:18,600 --> 01:18:20,310 1596 01:18:20,310 --> 01:18:22,920