1 00:00:01,069 --> 00:00:03,189 the following content is provided under a Creative Commons license your support will help MIT OpenCourseWare continue to offer high quality educational resources for free to make a donation or to view additional materials from hundreds of MIT courses visit MIT opencourseware at ocw.mit.edu all right so today we're gonna talk about cache oblivious algorithms who remembers what a cache oblivious algorithm is so what is a cache oblivious algorithm oblivious to cache size so a cache oblivious algorithm is an algorithm that automatically Tunes for the cache size on your machine so it achieves good cache efficiency and the code doesn't need to have any knowledge of the cache parameters of your machine in contrast a cache aware algorithm would actually know the parameters of the cache sizes on your machine and the code would actually put the sizes of the cache inside so today we're gonna talk a lot more about cache oblivious algorithms last time we talked about one cache oblivious algorithm that was for matrix multiplication and today we're gonna talk about some other ones so first example I want to talk about is simulation of heat diffusion so here's a famous equation known as the heat equation and this equation is in two dimensions and we want to simulate this function you that has three parameters t x and y t is the time step x and y are the XY coordinates of the 2d space and we want to know the temperature for each x y coordinate for any point in time T and the 2d heat equation can be modeled using a differential equation so how many of you have seen differential equations before okay so good most most of you so here I'm showing the equation in two dimensions but you can get similar equations for higher dimensions so here it says the partial derivative of U with respect to T is equal to alpha alpha is what's called the thermal diffusivity it's a constant times the sum of the second partial derivative of U with respect to X and the partial second partial derivative you with respect to Y so this is a pretty famous equation and you can see that we have a single derivative on the left side and a double d'oeurves on the right side and how do we actually write code to simulate this to deihi process so oftentimes in scientific computing people will come up with these differential equations just to describe physical processes and then they want to come up with efficient code to actually simulate the physical process okay so here's an example of a 2d heat diffusion so let's say we started out with configuration on the left and here the the color corresponds to a temperature so a brighter color means it's hotter yellow is the hottest and blue is the coldest and on the Left we just have six 172 which is the course number so if you didn't know that you're probably in the wrong class and then afterwards we're gonna run it for a couple time steps and then the heat is gonna diffuse to the neighboring regions of the 2d space so after we run it for a couple steps we might get the configuration on the right where the heat is more spread out now and oftentimes we want to run this simulation for a number of time steps until the distribution of heat converges so it becomes stable and it doesn't change by much anymore and then we stop the simulation so this is the 1d heat equation I showed you at a 2d one earlier but we're actually going to generate code for the 1d heat equation since it's simpler but all the ideas generalize to higher dimensions and here's the range of colors corresponding to temperatures so the hottest colors on the left and the coldest colors on the right and if you had a heat source that's on the left-hand side of this bar then this might possibly be a stable distribution so if you keep running the simulation you might get a stable distribution of heat that looks like this okay so how do we actually write code to simulate this differential equation so one commonly used method is known as finite difference approximation so we're going to approximate the partial derivative of U with respect to each of its coordinates so the partial derivative of U with respect to T is approximately equal to U of T plus delta T where delta T is some small small value and X and then minus U of T X and then that's all divided by delta T so how many of you have seen this approximation method before from your calculus class okay good so as you bring the value of delta T down to zero then the thing on the right hand side approaches the true partial derivative so that's a partial derivative with respect to T we also need to get the partial derivative with respect to X and here I'm saying that partial derivative with respect to X is approximately equal to UT of X plus Delta x over 2 minus UT of X minus Delta X over 2 all divided by Delta X so notice here that instead of adding Delta X in the first term and not adding anything in the second term I'm actually adding Delta X over 2 in the first term I'm subtracting Delta X over 2 in the second term and it turns out that I can do this with the approximation method and it still turns out to be valid as long as the two terms that I'm putting in there their difference is Delta X so here the difference is Delta X and I can basically decide how to split up this Delta X term among the two things in the numerator and the reason why ice I chose Delta X over 2 here is because the math is just gonna work out nicely and it's gonna give us cleaner code this is just a first partial derivative with respect to X I actually need the second partial derivative since the right-hand side of this equation has a second partial derivative so this is what the second partial derivative looks like so I just take the partial derivative with respect to X of each of the terms in my numerator from the equation above and then now I can actually plug in the value of this partial derivative by applying the equation above using the arguments T and X plus Delta x over 2 and similarly for the second term so for the first term when I plug it into the equation for the partial derivative of with respect to X I'm just gonna get UT of X plus Delta X minus UT x and then for the second term I'm going to get UT of X minus Delta X and then I subtract another factor of UT X so that's why I'm subtracting two times UT x in the numerator here and then the partial derivative of each of the things in the numerator also have to divide by this Delta X term so on the denominator I get Delta x squared so now I have the second partial derivative with respect to X and I also have the first partial derivative with respect to T so I can just plug them into my equation above so on the left hand side I just have this term here and I'm multiplying by this alpha constant and then on this term just comes from here so this is what the 1d heat equation reduces to using the finite difference approximation method so any questions on this okay so how do we actually write code to simulate this equation here so we're gonna use what's called a stencil computation and here I'm going to set Delta X and delta T equal to one just for simplicity but in general you can set them to whatever you want you can make them smaller to have a more fine-grained simulation so my set Delta X and delta T equal to 1 then the denominators of these two terms just become one and I don't need to worry about them and then I'm going to represent my 2d space using a 2d matrix where the horizontal axis represents values of X and the vertical axis represents values of T and I want to fill in all these entries that have a black dot in it the ones with the orange dot those are my boundaries so those are actually fixed throughout the computation so I'm not gonna do any computation on those those are just given to me as input so they's they could be heat sources if we're doing the heat simulation and then now I can actually write code to simulate simulate this equation so if I want to compute U of T plus 1 X I can just go up here and I see that that's equal to this thing over here and then I bring the negative UT X term to the right so I get UT of X plus alpha times u T X plus 1 minus 2 times u T X plus u T X minus 1 you know as I said before we just want to keep iterating this until the temperatures becomes stable so I'm going to proceed in time which in time is going up here and to compute one of these points so let's say this is UT plus 1x I need to know the value of UT X which is just the thing below me in the matrix and I also need to know UT X + 1 + UT X minus 1 and those are just blowme and diagonal to either the left or the right side so each value here just depends on three other values and this is called a three-point stance so this is the pattern that this equation is representing and in general a stencil computation is going to update each point in an array using a fixed pattern this is called a stencil so I'm going to do the same thing for all of the other points and here I'm going to compute all of the values of X for a given time step and then I move on to the next time step and then I keep doing this until the the distribution of temperatures becomes stable and then I'm done okay so these stencil computations are widely used in scientific computing they're used for weather simulations stock market simulations fluid dynamics image processing probability and so on so they're all they're used all over the place and science so this is a very important concept to know so let's say I I just ran the code as I showed you in the animation so I completed one row at a time before I moved on to the next row how would this code perform with respect to caching yes yeah so if ax is small this would do pretty well but what if X is much larger than the size of your cache yeah you would do badly and and why is that yeah so it turns out that there there could be some reuse here because when I compute the second row I'm actually using some values that are computed in the first row but if the row is much larger than the cache size by the time I get to the second row the values that are loaded into cache from the first row without then evict 'add and therefore I'm going to suffer a cache miss again for the second row for the values I need to load and even though they could have been used if we made our code have good locality another question I have is if we only cared about the values of X at the most recent time stub do we actually have to keep around this whole 2d matrix or can we get by with less storage yeah so how many how many rows would I have to keep around if I only cared about the most recent time stub yeah - and why is that right so I need to keep around 2 rows because I need the row from the previous time step in order to compute the values in the current time stuff and after the current time step I can just swap the roles of the two rows that I'm keeping around and then reused a previous row for the next one yes so I I need to know when I'm competing in a second row I need to keep around all of these values that are computed in the first row because these values get fed to one of the computations in the second row so I need to actually keep all of them around yeah oh I see I see what you're saying yeah so that's actually good observation so you only need to keep a constant amount more storage because you'll just be overwriting the values as you go through the row so so if you keep around one row some of the values would be for the current times have been some of them would be from the previous time steps a good observation yes okay so so that code as we saw it wasn't very cache efficient you could make it cache efficient using tiling but we're gonna go straight to the cache oblivious algorithm because it's much cleaner so let's recall the ideal cache model we talked about this in the previous lecture so here we have a two-level hierarchy we have a cache and then we have the main memory and the cache kit the cache has size of M bites and a cache line is B bytes so you can keep around M over B cache lines in your cache and if something's in cash and you operate on it then it doesn't incur any cache misses but if you have to go to main memory to load the cache line in then you incur one cache miss the ideal cache model assumes that the cache is fully associative so any cache line can go anywhere in the cache and it also assumes either an optimal of mission replacement policy or the LRU policy so the optimal omission replacement policy knows the sequence of all future requests to memory and when it needs to evict something it's going to pick the thing that leads to the fewest cache misses overall to evict the LRU policy Jesse Vicks the things the thing that was least recently used but we saw from the previous lecture that in terms of asymptotic costs these two replacement policies will give you cache misses within a constant factor of each other so you can use either one depending on what's convenient and to performance measures that we care about when we're analyzing an algorithm in the ideal cash model are the work in the number of cache misses so that work it's just a total number of operations that the algorithm incurs and serially this is just the ordinary running time in a number of cache misses it's a number of cache lines you have to transfer between the cache and your main memory so let's assume that we're running an algorithm or analyzing an algorithm in the ideal cache model and it runs serially what kinds of cache misses does the ideal cache model not capture so remember we talked about several types of cache misses and there's one type of cache miss that this model doesn't capture when we're running serially so let's assume we're running this see really without any parallelism here so the sharing miss has only come about when you have parallelism yes yes so the answer is conflict misses and why is that why does this model not capture it yeah so this is a fully associative cache so any cache line can go anywhere in the cache and you can only get conflict misses for set associative schemes where each cache line can only be mapped to a particular set and if you have too many cache lines that map to that particular set then you're gonna keep evicting each other even though the rest of the cache could have space and that's what's called a conflict miss the ideal cache model does capture capacity misses so therefore it's still a very good model to use at a high level when you're designing efficient algorithms because I encourage use to encourage you to optimize for spatial and temporal locality and once you have a good algorithm in the ideal cache model then you can start dealing with conflict misses using some of the strategies that we talked about last time such as padding or using temporary memory so any questions on this okay so this is the code that does the heat simulation that we saw earlier so it's just two for loops the nested for loop in the outer loop we're looping over the time-dimension in the inner loop we're looping over the space dimension so we're computing all of the values of X before we move on to the next time step and then we're storing two rows here and we're using this trick called even-odd trick and here's how it works so to access the next row that we want to compute that we just do t plus 1 mod 2 and then tax as a current row is just T mod 2 so this is implicit or oh we're keeping around as we progress through time and then we're gonna set T plus U of T plus 1 mod to X equal to kernel of you a pointer to UT mod to X and this kernel function is defined up here and recall that when we're we're actually passing a pointer to this kernel function we can actually treat a pointer as the beginning of an array so we're using array notation up here inside the kernel function so that ray W is passed as input and then we need to return W of 0 that's just a element at the current pointer that we passed to kernel and then we add alpha times W of negative 1 that's 1 LM element before the thing that we're pointing to minus 2 times W of 0 plus W of one W of 1 is the next element that we're pointing to ok so let's look at the caching behavior of this code so we're gonna analyze the cache complexity and we're gonna assume the LRU replacement policy here because we can and as we said before we're going to loop through one entire row at a time before we go on to the next row so the number of cache misses I get assuming that n is greater than M so that the row size is greater than the cache size the number of cache misses is Theta of NT over B so how do I get this cache complexity bound here so how many cache misses do I have to incur for each row of this 2d space that I'm computing yes right so I need n over B cache misses for each row and this is because I can load in B bytes at a time so I benefit from spatial locality there and then I have n elements that I need to compute so it's theta n over B per row and as we said before when we get to the next row the stuff that we need from the previous row have already been evicted from cache so I basically have to incur theta n over B cache misses for every row and the number of rows I'm going to compute as T so it's just data of NT over B any questions on this analysis so how many do you think we can do better than this okay so one person two three okay so turns out that we can do better than this you can actually do better with tiling but I'm not gonna do the tiling version I want to do the cache oblivious version and the cache oblivious version is going to work on trapezoidal regions in the 2d space and recall that a trapezoid has a top base in a bottom base and here the top base is at t1 the bottom base is at T zero and the height is just t1 minus t0 in the width of a trapezoid it's just the width of it at the midpoint between t1 and t0 so at T 1 plus T 0 over 2 so we're going to compute all of the points inside this trapezoid that satisfy these inequalities here so T has to be greater than or equal to t0 and less than t1 and then X it's greater than or equal to x0 plus DX 0 times t minus t0 so DX 0 is actually the inverse slope here and then it also has to be less than x1 plus DX 1 times t minus t0 so DX 1 is the inverse slope on the other side and DX is 0 and DX 1 have to be either negative 1 0 or 1 so negative 1 just corresponds to inverse slope of negative 1 which is also a slope of negative 1 if it's 1 then it's just a slope or inverse low-proof 1 and then if it's 0 then we just have a vertical line okay so the nice property of this trapezoid is that we can actually compute everything inside the trapezoid without looking outside the trapezoid so we can compute everything here independently of any other trapezoids that we might be generating and we're gonna come up with a dividing conquer approach to execute this code so the divide and conquer algorithm has a base case so our base case is going to be when the height of the trapezoid is 1 and when the height is 1 then we're just going to compute all of the values using a simple loop can any order of the computation inside this loop is valid since we have all the values in the base of the trapezoid and we can compute the values in the top of the trapezoid in whatever order they don't depend on each other so that's the base case any questions so far so here's one of the recursive cases it turns out that we're gonna have two different types of cuts the first cut is called a space cut so I'm gonna do a space cut if the width of the trapezoid is greater than or equal to twice the height so this means that the trapezoid is too wide and I'm going to cut it vertically more specifically I'm going to cut it with a line with slope one negative one going through the center of the trapezoid and then I'm going to traverse the trapezoid on the left side first and then after I'm done with that traverse the trapezoid on the right side so can I can I actually switch the order of this can I compute the stuff on the right side before I do this stuff on the left side no why is that yeah so there's some points in the right trapezoid that depend on the values from the left trapezoid and so so for the left trapezoid every point we want to compute we already have all of its points assuming that we get all the values of the base points but for the right hand side some of the values depend on values in the left trapezoid so we can't execute the right trapezoid until we're done with the left trapezoid and this is the reason why I cut this trapezoid with a slope of negative one instead of using a vertical vertical cut because if I did a vertical cut then inside both of the trapezoids I would have points that depend on the other trapezoid so this is one of the two cuts it's called a space cut and it happens when the trapezoid is too wide the other cut is the time cut I'm gonna cut with respect to the time dimension and this happens when the trapezoid is too tall so when the width is less than twice the height of the trapezoid then what I'm gonna do is I'm just gonna cut it with a horizontal Zonta line through the center and then I'm going to traverse the bottom trapezoid first and after I'm done with that I can traverse a top trapezoid and again the top trapezoid depends on some points from the bottom trapezoid so I can't switch the order of those any questions okay so let's now look at the code that implements this recursive divide and conquer algorithm so here's the C code it takes as input t0 and t1 these are the coordinates of the top in the bottom of the trapezoid or bottom and top of the trapezoid then x0 is the left side of the trapezoid of the base of the trapezoid d x0 is the inverse slope and the x1 is the right side of the bottom base of the trapezoid and DX 1 is the inverse slope on the right side so we're first going to compute the height of our trapezoid and we're gonna let LT be the height and that's just t1 minus t0 and if the height is 1 then we're just going to use a for loop over all the points in that in that height 1 trapezoid okay we're going to call this kernel function that we defined before and otherwise the height is greater than 1 and we're gonna check whether we should do a space cut or a time cut so we do a space cut if the trapezoid is too wide and this condition inside the if Clause is checking if the trapezoid is too wide and if so then we're going to make two recursive calls to try out this oeid and we're gonna cut it in the middle using this slope of negative 1 so you see the negative ones here in the recursive calls and otherwise we'll do a time cut and for the time cut we just cut it in the middle so we cut it at the value of T that's equal to L T divided by 2 or t0 plus L T divided by 2 and again we have two recursive calls to trapezoid ok so even though I'm only generating slopes of negative 1 in this recursive call this code is gonna work even if I have slopes of 1 and 0 because I could start out with slopes of 1 and 0 for example if I had a rectangular region then the slopes are just gonna be 0 and this code is still gonna work but most of the slopes that I'm dealing with are going to be a negative 1 because those are the new slopes and I'm generating any questions so this code is very concise it turns out that the even odd trick still works here you can still keep around just two rows because your guarantee that you're not gonna write overwrite any value until all the all the things that depend on it are computed but when you're come when you're using just two rows the values in a particular row might not all correspond to the same time step because we're not finishing an entire row before we move on to the next one we're actually partially computing rows okay so let's analyze the cache complexity of this algorithm again we're going to use the recursion tree approach that we talked about last time and our our code is going to split itself into two subproblems at every level until it gets to a leaf and each leaf represents theta of HW points where H is Theta of W so H is the height double use the width and we have the property that H is equal to theta of W because of the nature of the algorithm when the trapezoid becomes too wide we're going to make it less wide by doing a space cut and if it becomes too tall we're going to do a horizontal cut so we're guaranteed that the height and the width are going to be within a constant factor of each other when we get to the base case and each leaf in the base case it's going to incur theta of W over B misses because we have to load in the base of the trapezoid from cache from main memory and once that's in cache we can compute all of the other points in the trapezoid without incurring any more cache misses so the cache misses per leaf is just theta W over B and we're gonna set W equal to theta of M in an in the analysis because that's the point when the trapezoid fits into cache the algorithm doesn't actually have any knowledge of this M parameter so it's still going to keep dividing conquering until I gets the base case of size 1 but just for the analysis we're going to use a base case when w is theta of M so the number of leaves we have is Theta of NT divided by HW because each leaf is of size theta HW the number of internal nodes we have is equal to a number of leaves minus one because we have a tree here but the internal nodes don't contribute substantially to the cache complexity because each of them is just doing a constant number of cache misses to set up the two recursive calls and it's not doing anything more expensive than that so we just need to compute the number of cache misses at the leaves we have theta of NT over HW leaves each one of which takes theta of W over B cache misses and we're going to multiply that out so the W term cancels out we just have NT over HB and we set H and W to be theta of M so we're just left with NT over MB as our cache complexity yes sure so each leaf only incurs theta W over B cache misses because we need to compute the values of the base of the trapezoid and that's going to incur theta of W over B cache misses because it's W wide and then for everything else it's going to fit into cache so when we compute them we already have all of our previous values that we want into cash in cache so so that's why it's not gonna incur any more cache misses does that make sense yeah okay so this was just analysis for one dimension but it actually generalizes to more than one dimensions so in general if we have D dimensions then the cache complexity is going to be theta of NT divided by M to the one over D times B so if d is one then that just reduces to NT over MB if D is two then it's going to be NT over B times square root of M and so on it turns out that this bound is also optimal so any questions so compared to the looping code this code actually has much better cache complexity it's saving by a factor of M the looping code had a cache complexity that was theta of NT over B it didn't have an M in the denominator okay so we're actually gonna do a simulation now we're gonna compare how the looping code and a trapezoid code runs and in this simulation the green points correspond to a cache hit the purple points correspond to a cache miss and we're gonna seem a fully associative cache using the LRU replacement policy where the cache line size is four points and the cache size is 32 points and we're going to set the cache hit latency to be one cycle and the cache miss latency to be ten cycles so an order of magnitude slower for cache misses and we're doing this for a rectangular region where and is 95 and we're doing it for 87 time steps so let me pull out the simulation now okay so on the left hand side we're gonna have the looping code on the right hand side we're going to have the trapezoidal code so let's start this so you can see that the looping code is going over one row at a time where the trapezoidal code is not doing that it's partially computing one row and then moving on to the next row okay I can also show the the cuts that are generated by the trapezoidal algorithm have to remember how to do this so see so so there are there's the carts that are generated by the trapezoidal algorithm and I can speed this up so you can see that the trapezoidal algorithm is incurring much fewer cache misses than the looping code so I'll just make this go faster and the the trapezoidal code is going to finish while the looping code is still slowly making its way up to top okay so it's finally done so any questions on this yeah in which of the regions so it's loading you get a cache miss for the purple dots here and then the cache line size is four so you get a cache miss for the first point and then you hit on the next three points and you get another cache miss for the fifth point and then you hit on the six seven and eight points yeah so for the one above it so we're assuming that the the two arrays fit in cache already so we don't actually need to load them from memory the thing above it just depends on the values that we have already computed and that fits in cache those are ready in cache this this base of the trapezoid is already in cash and the row right above it we just need to look at those values does that make sense yeah yeah any other questions on this so I don't want to see this again or yeah so I could let this run for the rest of the lecture but I have more interesting material that I want to talk about so let's just stop after this finishes and as you see again the looping code is slowly making its way to the top it's much slower than the trapezoid code okay so that was only for one dimensions now let's look at what happens in two dimensions and here I'm going to show another demo and this demo I'm actually going to run the code for real the previous demo was just a simulation so this is going to happen in real time and I'm going to simulate the heat in a 2d space and recall that the colors correspond to the temperatures so a brighter color means it's hotter so let me pull out the other demo okay so here my mouse cursor is the source of heat so you see that it's making the the points around my mouse cursor hot and then it's slowly diffusing its way to the other points now I can actually move this so that my heat source changes and then the heat I generated earlier slowly goes away okay so so in the lower left hand corner I'm showing the number of iterations per second of the code and we can see that the looping code is taking it's doing about fifteen hundred and sixty iterations per second let's see what happens when we switch to the trapezoid code so the trapezoid code is doing about eighteen hundred and thirty iterations per second so it's a little bit faster but not by to much does anyone have an idea why we're seeing this behavior so we said that the trapezoid code incurs many fewer cache misses than the looping code so we would expect it to be significantly faster but here it's only a little bit faster yeah right so that's that's a good point so in 2d you're only saving a factor of square root of M instead of M but square root of M is still pretty big compared to the speed-up we're getting here so any other ideas yeah so there is a constant factor in the trapezoidal code but even after accounting for the Trump for the constant factor you should still see a better speed-up than this so even accounting for the constant factors what other problem might be going on here yeah yeah so that's that's another good observation but the caches that we're using they still should get pretty good good cache I mean they should still have cache complexity that's pretty close to the ideal cache model I mean you might be off by a small constant factor but not by too much yeah sorry could you repeat that [Music] yes oh okay so if I move the cursor it's probably gonna be a little bit slower you know so gust slower by a little bit but that doesn't really affect the performance I can just leave my cursor there and this is what the iterations per second is yes [Music] - maybe we're seeing the crow's-feet yeah so there is some other factor dominating does anyone have an idea what that factor what might be no it's not it's not the rendering I I ran the code without showing the graphics and performance was similar yes yes yeah so there could be other things using the cache but that would be true for both of the programs and I don't actually have anything that's in sense of running set for PowerPoint I don't think that uses much of my cache all right so let's look at why this is the case so it turns out that the hardware is actually helping the looping code so the question is how come the cache oblivious trapezoidal code can have so many fewer cache misses but that Vantage gained over the looping version is so marginal it turns out that for the looping code the hardware is actually helping it by doing Hardware prefetching and hardware prefetching for the looping code is actually pretty good because the access pattern is very regular it's just going one row at a time so the hardware can predict the memory access pattern of the looping code and therefore it can bring in the cache lines that the looping code would need before it actually gets to that part of the computation so prefetching is helping the looping code and it's not helping the trapezoidal code that much because the access pattern is less regular there and prefetching does use memory bandwidth but when you're using just a single core you have more than enough bandwidth to take advantage of the hardware prefetching capabilities of the machine but later on we'll see that when we're running in parallel the memory bandwidth does become more of an issue when you have multiple processors all using the memory yeah question thank you [Music] you can do software prefetching there are there are instructions to do software prefetching hardware prefetching is usually more efficient but it's likes less flexible than the software prefetching but here we're not actually doing that we're just taking advantage of hardware prefetching yeah so we we didn't actually try that it could benefit a little bit if we use a little bit of software prefetching although I think it would benefit the looping code probably as well if we did that yes sorry sorry uh yeah so hardware prefetching is enabled it's always done by the hardware on the machines that we're using today so this was a this is a pretty surprising result but we'll actually see the fact of the memory bandwidth later on when we look at the parallel code any other questions before I continue okay so let's now look at the interplay between caching and parallelism so this was a theorem that we proved in the previous lecture so let's just recall what it says says let choose a P be the number of cache misses in a deterministic still computation when run on P processors each with a private cache and let S sub P be a number of successful steals during the computation in the ideal cache model with a cache size of M in a block size of B the number of cache misses on P processors equal to the number of cache misses on one processor plus order number of successful steals times M over B and last time we also said that the number of successful steals is upper bounded by the span of the computation times the number of processors if you minimize the span of your computation then you can also minimize the number of successful steals and then for low span algorithms the first term is usually going to dominate the q1 term so I'm not going to go over the proof we did that last time the moral the story is that minimizing cache misses in the serial illusion essentially minimizes them in the parallel execution assuming that you have a low span algorithm so let's see whether our trapezoidal algorithm works in parallel so does the space cut work in parallel recall that the space cut I'm cutting it with a slope of negative 1 through the center because it's too wide so can I execute the two trapezoids in parallel here no the reason is that the right travel SOI depends on the result of the left trapezoid so I can't execute them at the same time but there is a way that I can execute traps always in parallel so instead of just doing the cut through the center I'm actually going to do a V cut so now I have three trapezoids the two trapezoids in black I can't actually execute those in parallel because they're independent and everything in those two trapezoids just depends on the base of that trapezoid and after I'm done with the two trapezoids labelled one then I can X I can compute the trapezoid label 2 and this is known as a parallel space cut it produces two black trapezoids as well as a gray trapezoid in the blue to black trapezoids execute in parallel and afterwards the gray trapezoid executes and this is done recursively as well any questions yeah no ok we also had the time cut oh sorry so if the trapezoid is inverted then we're going to do this upside down v cut and in this case we're going to execute the middle trapezoid before we execute the two trapezoids on the side for the for the time cut it turns out that we're just gonna use the same cut as before and we're just gonna execute the two trapezoids serially so we do get a little bit of parallelism here from the parallel space cut let's look at how the parallel codes perform now so okay so this was a serial looping code here's the parallel looping code so we had 1450 before about 3700 now so little over twice the speed up and this is on a four core machine it's just on my laptop sorry oh yeah sure yeah slowing down a little bit but not by too much okay let's look at the trapezoidal code No so as we saw before the trapezoid code does about 1840 iterations per second and we can paralyze this so now it's doing about fifty three fifty iterations per second so it's getting about a factor of three speed-up okay I can move it around a little bit more if you want to see you Susan serial trapezoid and parallel trapezoid is everyone happy okay because I had to do this in real time the input size wasn't actually that big so I ran the experiment offline without the rendering on a much larger input so this input is a three thousand by three thousand grid and I did this for one thousand time steps using four processor cores and my cache size is eight megabytes the last level cache size so the input size here is much larger than my last level cache size and here are the times that I got so the serial looping code took about 129 seconds and when we did it in parallel it was about 66 seconds so got about a factor of two speed-up which is consistent with what we saw for the trapezoidal code it actually got a better speed up well we increase the input size so I got about a factor of four speed-up and this is because for larger input size the cache cache efficiency plays a much larger role because the cache is so small compared to our input size so here we see that the parallel looping code achieves less than half of the potential speed up even though the parallel looping code has much more potential parallelism than the trapezoidal code so a trapezoidal code only had a little bit of parallelism only for the space cuts whereas the the trapezoidal code is actually getting pretty good speed up so this is near linear speed-up since I'm using four cores in this getting three point nine six X speed-up so what could be going on here another thing to look at is to compare the serial trapezoid code with the serial looping code as well as the parallel trapezoid code with the peril parallel looping code so if you look at the serial trapezoid code you see that it's about twice as fast as the serial looping code but the parallel trapezoidal code is about four times faster than the parallel looping code and the reason here is that the the hardware prefetching can't help the parallel looping code that much because when you're running in parallel all of the cores are using memory and there's a memory bandwidth bottleneck here and prefetching actually needs to use memory bandwidth so in the serial case we had plenty of memory bandwidth you we could use for prefetching but in the parallel case we had we don't actually have much parallel but much memory bandwidth we can use here so that's why in a parallel case the trapezoidal code gets better speed up over the parallel looping code compared to the serial case and the trapezoid code also gets better speed up because because it's actually it does things more locally so it needs to use less fewer memory operations and there's a scalability bottleneck at the memory interconnect but because the trapezoidal code is cache oblivious it does a lot of work in cache whereas the looping code does more accesses to the main memory any questions on this okay so how do we know when we have when we have a parallel speed-up bottleneck how do we know what what's causing it so there are several main things that we should look at so we should see if our algorithm has insufficient parallelism whether the scheduling overhead is dominating whether we have a lack of memory bandwidth or whether there's contention going on in contention can refer to either locking or true and false sharing which we talked about in the last lecture so first two are usually quite easy to diagnose you can compute the work and the span of your algorithm and from that you can get the parallelism you can also use silk scale to help you diagnose the first two problems because silk scale can tell you how much parallelism you're getting in your code and it can also tell you the burden parallelism which includes the scheduler overhead what about for memory bandwidth how can we diagnose that says anyone have any ideas so I can tell you one way to do it I can open up open up my hardware and take out all of my memory chips except for one of them and run my serial code and if it slows down then that means it was probably memory bandwidth bound when I did it in parallel but that's a pretty heavy-handed way to diagnose this problem is there anything we can do with just software yes yeah so you could use a tool like valgrind to count the number of memory accesses yes [Music] so you can make sure only one processor is being used but you can't you can't but it might be using like more that more memory than just the memory from one chip there's actually a simpler way to do this the idea is that we'll just run P identical copies of the serial code and then dole all be executing in parallel and if the execution slows down that that means that they were probably contending for memory bandwidth does that make sense one caveat of this is you can only do this if you have enough physical memory because if when you're running P identical copies you have to use more DRAM than if you just ran one copy so you have to have enough physical memory but often times you can usually isolate some part of the code that you think has a performance bottleneck and just execute that part of the program with P copies in parallel and hopefully that will take less memory there also hardware counters you can check if you have root access to your machine that can measure how much memory bandwidth your program is using but this is a pretty simple way to do this so there are ways to diagnose lack of memory bandwidth turns out that contention is much harder to diagnose there are tools that exists that detect lock contention in an execution but usually they only detect the contention when the condensor actually happens but the contention doesn't have to happen every time you run your code so these tools don't detect the potential for lock lock contention and potential for true and false sharing is even harder to detect especially false sharing because if you're using a bunch of variables in your code you don't know which of those mapped to the same cache line so this is much harder to detect usually when you're trying to debug the speed-up bottleneck in your code you would first look at the first three things here and then once you eliminated those first these things then you can start looking at whether contention is causing the problem any questions okay so talked about stencil computation I want to now talk about another problem sorting and we want to do this cache efficiently okay so let's first analyze the cache complexity of a standard merge sort so we first need to analyze the complexity of merging because this is used as a subroutine in merge sort and as you recall in merging we're given two sorted input arrays and we want to generate an output array that's also sorted containing all the elements from the two input arrays and the algorithm is going to maintain a pointer to the head of each of our input arrays and then it's going to compare the two elements and take the smaller one and put it into the output and then increment the pointer for that array and then we keep doing this taking the smaller of the two elements until the two input arrays become empty at which point we're done with the algorithm we have one sorted output array okay so to merge n elements the time to do this is just theta of n here n is the sum of the sizes of my two input arrays and this is because I'm only doing a constant amount of work for each of my input elements what about the number of cache misses how many cache misses why why incur when I'm merging n elements yes yeah so I'm going to incur theta n over B cash misses because my two input arrays are stored contiguously in memories I can read them in B bytes at a time with just one cache miss and then my output array is also stored contiguously so I can write things out B bytes at a time with just one cache miss I might waste waste a cache line at the beginning and end of each of my three arrays but that only affects the bound by a constant factor so it's theta of n over B so now let's look at merge sort so recall that merge sort has two recursive calls to itself on inputs of half the size and then it doesn't merge at the end to merge the two sorted outputs of its recursive calls so if you look at how the recursion proceeds its first going to divide the input array into two halves it's going to divide it into two halves again again until you get to the base case of just one element at which point you return and then now we start merging pairs of these elements together so now I have these arrays of size two in sorted order and I emerged pairs of those arrays and I get sub arrays of size four and that finally I do this one more time to get my sorted output ok so let's review the work of merge sort so what's the recurrence for merge sort if we're competing the work yes this yeah so that's correct so I have two sub problems of size n over two and that I need to do theta and work to do the merge and this is case two of master theorem so I'm computing log base B of A which is log base 2 of 2 in dashes 1 and that's the same as the exponent in the term that I'm adding in so since they're the same I add in an additional log factor and my overall work is just theta of n log n okay so now I'm gonna solve this recurrence again using the recursion tree method I'm still gonna get theta of n log n but I'm doing this because it's gonna be useful when we analyze the cache complexity so the top level I have a problem of size N and I'm going to branch into two problems of size n over two and when I'm done with them I have to do a merge which takes theta n work and I'm just putting n here I'm ignoring the constants and I'm got branch again and each one of these is gonna do n over 2 work to merge and I'm gonna get all the way down to my base case after log base two of n levels the top level is doing and work second level is also doing n work and it's gonna be the same for all levels down to the leaves leaves is also doing linear amount of work so the overall work is just the work per level times the number of levels so it's just eight of n log N any questions okay so now let's analyze this with caching okay so we said earlier that the cache complexity of the merging subroutine is theta of n over B and here's the recurrence for the cache complexity of merge sort so my base case here is when n is less than or equal to CM for some sufficiently small constant C and this is because at this point my problem size fits into cache and everything else I do for that problem is still going to be in cash and to load it into cash the the base case I need to incur theta and over B cache misses and otherwise I'm going to have two recursive calls of size n over two and then I need to do theta and over B cache misses to do the merge of my two results okay so here my base case is larger than when I what I did for the work the algorithm actually stole recursing down to a constant massage base case but just for analysis I'm stopping the recursion when and is less than or equal to CM so let's analyze this so again I'm going to have the problems of size n at beginning and then I'm gonna split it into two problems of size n over two and then I'm gonna have to pay and over B cache misses to merge the results together similarly for the next level now I'm paying and over to be cache misses for each of my two problems here to do the merge and I keep going down until I get to a sub problem of size theta of CM at that point is going to fit in cache and I don't need to recurse anymore in my analysis so a number of levels of this recursion tree it's just log base two of n over cm so I'm basically chopping off the bottom of this recursion tree the number of levels I had below this is log base two of CM so I'm taking log base two of n and subtracting log base two of CM and that's equivalent to log base two of n divided by CM the number of leaves I have is n over cm since each leaf is Theta of CM large in the number of cache misses I need to incur to process a leaf it's just theta of M over B because I just need to load the the problem I just need to load the input for that sub problem into cache and then everything else fits in cache so for the top level I'm incurring and over B cache misses the next level I'm also incurring n over B cache misses same with the third level and then for the leaves I'm incurring M over B times and over C on cache misses the n over cm is the number of leaves I have in theta of M over B is the number of cache misses per leaf and that also equals theta of n over B so overall I multiply theta n over B by the number of levels I have so the number of levels I have is log base two of n over cm and I just got rid of the constant here since doesn't affect asymptotic bound so the number of cache misses I have a state of n over B times log base two of or any log any base for the log log of n over m so any questions on this analysis Sok I am saving a factor of B here in the first term so that does reduce that just gives me a better cache complexity than just a work bound but for the M term it's actually inside the denominator of the log and that doesn't actually help me that much so let's look at how much we actually save so when n is much greater than M then log base 2 of n over m is approximately equal to log base 2 of n so the M term doesn't contribute much to the asymptotic cost and therefore compared to the work we're only saving a factor of B when n is approximately equal to M then log base 2 of n over m is constant and we're saving a factor of B log n so we save more when the memory size or when the problem size is small but for large enough problem sizes we're only saving a factor of B so does anyone think that we can do better than this so I've asked this question several times before and the answer is always the same yes it's a good answer so we're gonna do this using multi way merging so instead of just merging two sorted sub-arrays we're gonna merge together our sorted sub-arrays so we're going to have our sub arrays each of size and over R and these are sorted and I'm gonna merge them together using what's called a tournament tree so how determinate tournament tree works is I'm going to compare the heads of each pair of these sub arrays and store it in the node of the tournament tree and then after I do that I compare these two nodes and I get the minimum of those two eventually after I compare all all the elements I'm just left with the minimum element at the root of the tournament tree and then I can place that into my output so the first time I want to fill this tournament tree it's going to take theta of our work because there are nodes in my tournament tree so when I compare these two elements the smaller one is six for these two the smaller one is two and then I compare two and six take the smaller one and then on the other side I have a seven here so I compare two and seven two is smaller so it appears at the root and then I know that that's going to be my minimum element among the heads of all of the our sub arrays that I'm passing it so the first time to generate this tournament tree takes theta of our work because I have to fill in our nodes but once I generated this tournament tree for all subsequent rounds I only need to fill in the path from the element that one to the output or rate or to the root of the tournament tree because those are the only values that would have changed so now I'm going to fill in this path here and this only takes theta of log our work to do it because the the height of this tournament tree is log base 2 of R so I'm going to fill this in now 14 goes here 6 is a smaller of the two and then 6 is a smaller to 2 again so my next element is 6 and I keep doing this until all of the elements from my our sub arrays get put into the output the total work for merging is going to be theta of R for the first round plus n log R for all the remaining rounds and that's just equal to theta of n log R because we're assuming that n is bigger than R here any questions on how the multi way merge works No okay so let's analyze the work of this multi way merge that when used in size inside merge sort so recurrence is gonna be as follows so if n is 1 then we just do a cost amount of work otherwise we have our subproblems of size and over R and then we're paying stative n log R to do the multi way merge so here's the recursion tree at the top level we have a problem of size n and then we're gonna split into our subproblems of size n of R and that we have to pay and log our work to merge the results of the recursive call together and then we keep doing this turns out that the work at each level sums sums up to n log based let n log base 2 of R in the number of levels we have here is log base R of n because each time we're branching by a factor of R for the leaves we have n leaves and we just pale in your work for that because we don't have to pay for the merge we're not doing any more recursive calls so the overall work is going to be theta of n log R times log base R of n plus n for the leaves but that's just a lower order term and if you work out the math some of these terms are gonna cancel out and you just get theta of n log n for the work so the work is the same as the binary merge sort let's now analyze the cache complexity so let's assume that we have R less than C M over B for a sufficiently small constant C less than or equal to 1 we're going to consider the our way merging of contiguous arrays of total size N and if R is less than C M over B then we can fit the entire tournament tree into cache and we can also fit one block from each of the R sub arrays into cache and in that case the total number of cache misses to do the multi way merge is just theta and over B because we just have to go over the end elements in our input arrays so the recurrence for the are way merge sort is as follows so if n is less than cm then it fits in cash it's a number of cache misses we page the state of n over B and otherwise we have our subproblems of size and over R and that we add theta and over B to do the merge of the results of the subproblems any questions on the recurrence here yes good question so we didn't pick the value of R so this is not a cache oblivious algorithm and we'll see what to choose for our in a couple slides so let's analyze the cache complexity of this algorithm again using the recursion tree analysis so at the top level we're going to have our subproblems of size n over R and we have to pay and over B cache misses to merge them and it turns out that at each level the number of cache misses we have to pay is n over B if you work out the math and the number of levels of this recursion tree is going to be log base R of n over cm because we're going to stop recursing when our problem size is cm and on every level of recursion we're branching by a factor of R so our leaf our leaf size is cm therefore the number of leaves we have is n over cm and for each leaf it's going to take theta of M over B cache misses to load it into cache and afterwards we can do everything in cache and multiplying the number of leaves by the cost per leaf we get theta n over B cache misses and therefore the number of cache misses is n over B times the number of levels number of levels is log base R of n over m so compare to the binary merge sort algorithm here we actually have a factor of R in the base of the log before the the base of the log wishes to so now the question is what we're gonna set our to be so again we have a voodoo parameter this is not a cache oblivious algorithm and we said that R has to be less than or equal to CM / B in order for the analysis to work out so let's just make it as large as possible let's just set R equal to theta of M over B and now now we can see what this complexity works out to be so we have the tall cache assumption we also have the fact that log base M of n over m is equal to theta of log n over log M so the cache complexity estate of n over B times log base M over B of n over m but if we have the tall cache assumption that log base M over B is the same as log base of M asymptotically so that's how we get to the second line and then you can rearrange some terms and cancel some terms out and we'll end up with theta n log n divided by B log M so we're saving a factor of theta of B log M compare to the work of the algorithm whereas for the binary version of mercero we were only saving a factor of B for large enough inputs so here we get another factor of log m in our savings so as I said the binary 1 cache misses is n log n over B whereas the multi-way one is n log n over B log m and as long as n is much greater than M then we're actually saving more much more than the binary version so we're saving a factor of log M in cache misses and let's just ignore the constants here and look at what log m can be in practice so here are some typical cache sizes the l1 cache size is 32 kilobytes so that's to 215th and log base 2 of that is 15 so we get a 15 X savings and then for the larger cache sizes we get an even larger saving sign any questions on this so the problem with this algorithm is that it's not cache-oblivious so we have to tune the value of R for a particular machine and even when we're running on the same machine there could be other jobs running that content for the cache turns out that there are several cache oblivious sorting algorithms the first one that was developed was by paper by charles larsen and it's called funnel sort the idea here is to recursively sort end to the 1/3 groups of n to the 2/3 elements and then merged the sorted groups with an end to the 1/3 funnel and this funnel is called a cage K funnel more generally and it merges together K cubed elements and K sorted lists and the cost for doing this merge is as shown here and you plug this into the recurrence you'll get that the asymptotic number of cache misses is the same as that of the multi way merge sort algorithm while being cache oblivious and this bound is actually optimal so I'm not gonna have time to talk about the details of the funnel sort algorithm but I do have time to just show you a pretty picture of what the funnel looks like so this is what a K funnel looks like it's recursively constructed we have a bunch of square root of K funnels inside they're all connected to some buffers so they feed alamos to buffer and then the buffers feed element to this output square root of K funnel which outputs to which becomes the output for the K funnel so this whole blue thing is the K funnel and the small green triangles are square root of K funnels and the number of cache misses incurred by doing this merge is shown here and it uses the tall cache assumption for analysis so a pretty cool algorithm and we posted a paper online that describes this algorithm so if you're interested in the details I encourage you to read that paper there are also many other cache oblivious algorithms out there so that there's been hundreds of papers on cache oblivious algorithms here are some of them some of these are also described in the paper in fact I think all of these are described in the paper we posted so some cool cache-oblivious data structures have been developed such as for b-trees ordered filed maintenance and priority queues so it's a lot of literature on cache oblivious algorithms and there's also a lot of material online if you're interested in learning more 2 00:00:03,189 --> 00:00:05,769 the following content is provided under a Creative Commons license your support will help MIT OpenCourseWare continue to offer high quality educational resources for free to make a donation or to view additional materials from hundreds of MIT courses visit MIT opencourseware at ocw.mit.edu all right so today we're gonna talk about cache oblivious algorithms who remembers what a cache oblivious algorithm is so what is a cache oblivious algorithm oblivious to cache size so a cache oblivious algorithm is an algorithm that automatically Tunes for the cache size on your machine so it achieves good cache efficiency and the code doesn't need to have any knowledge of the cache parameters of your machine in contrast a cache aware algorithm would actually know the parameters of the cache sizes on your machine and the code would actually put the sizes of the cache inside so today we're gonna talk a lot more about cache oblivious algorithms last time we talked about one cache oblivious algorithm that was for matrix multiplication and today we're gonna talk about some other ones so first example I want to talk about is simulation of heat diffusion so here's a famous equation known as the heat equation and this equation is in two dimensions and we want to simulate this function you that has three parameters t x and y t is the time step x and y are the XY coordinates of the 2d space and we want to know the temperature for each x y coordinate for any point in time T and the 2d heat equation can be modeled using a differential equation so how many of you have seen differential equations before okay so good most most of you so here I'm showing the equation in two dimensions but you can get similar equations for higher dimensions so here it says the partial derivative of U with respect to T is equal to alpha alpha is what's called the thermal diffusivity it's a constant times the sum of the second partial derivative of U with respect to X and the partial second partial derivative you with respect to Y so this is a pretty famous equation and you can see that we have a single derivative on the left side and a double d'oeurves on the right side and how do we actually write code to simulate this to deihi process so oftentimes in scientific computing people will come up with these differential equations just to describe physical processes and then they want to come up with efficient code to actually simulate the physical process okay so here's an example of a 2d heat diffusion so let's say we started out with configuration on the left and here the the color corresponds to a temperature so a brighter color means it's hotter yellow is the hottest and blue is the coldest and on the Left we just have six 172 which is the course number so if you didn't know that you're probably in the wrong class and then afterwards we're gonna run it for a couple time steps and then the heat is gonna diffuse to the neighboring regions of the 2d space so after we run it for a couple steps we might get the configuration on the right where the heat is more spread out now and oftentimes we want to run this simulation for a number of time steps until the distribution of heat converges so it becomes stable and it doesn't change by much anymore and then we stop the simulation so this is the 1d heat equation I showed you at a 2d one earlier but we're actually going to generate code for the 1d heat equation since it's simpler but all the ideas generalize to higher dimensions and here's the range of colors corresponding to temperatures so the hottest colors on the left and the coldest colors on the right and if you had a heat source that's on the left-hand side of this bar then this might possibly be a stable distribution so if you keep running the simulation you might get a stable distribution of heat that looks like this okay so how do we actually write code to simulate this differential equation so one commonly used method is known as finite difference approximation so we're going to approximate the partial derivative of U with respect to each of its coordinates so the partial derivative of U with respect to T is approximately equal to U of T plus delta T where delta T is some small small value and X and then minus U of T X and then that's all divided by delta T so how many of you have seen this approximation method before from your calculus class okay good so as you bring the value of delta T down to zero then the thing on the right hand side approaches the true partial derivative so that's a partial derivative with respect to T we also need to get the partial derivative with respect to X and here I'm saying that partial derivative with respect to X is approximately equal to UT of X plus Delta x over 2 minus UT of X minus Delta X over 2 all divided by Delta X so notice here that instead of adding Delta X in the first term and not adding anything in the second term I'm actually adding Delta X over 2 in the first term I'm subtracting Delta X over 2 in the second term and it turns out that I can do this with the approximation method and it still turns out to be valid as long as the two terms that I'm putting in there their difference is Delta X so here the difference is Delta X and I can basically decide how to split up this Delta X term among the two things in the numerator and the reason why ice I chose Delta X over 2 here is because the math is just gonna work out nicely and it's gonna give us cleaner code this is just a first partial derivative with respect to X I actually need the second partial derivative since the right-hand side of this equation has a second partial derivative so this is what the second partial derivative looks like so I just take the partial derivative with respect to X of each of the terms in my numerator from the equation above and then now I can actually plug in the value of this partial derivative by applying the equation above using the arguments T and X plus Delta x over 2 and similarly for the second term so for the first term when I plug it into the equation for the partial derivative of with respect to X I'm just gonna get UT of X plus Delta X minus UT x and then for the second term I'm going to get UT of X minus Delta X and then I subtract another factor of UT X so that's why I'm subtracting two times UT x in the numerator here and then the partial derivative of each of the things in the numerator also have to divide by this Delta X term so on the denominator I get Delta x squared so now I have the second partial derivative with respect to X and I also have the first partial derivative with respect to T so I can just plug them into my equation above so on the left hand side I just have this term here and I'm multiplying by this alpha constant and then on this term just comes from here so this is what the 1d heat equation reduces to using the finite difference approximation method so any questions on this okay so how do we actually write code to simulate this equation here so we're gonna use what's called a stencil computation and here I'm going to set Delta X and delta T equal to one just for simplicity but in general you can set them to whatever you want you can make them smaller to have a more fine-grained simulation so my set Delta X and delta T equal to 1 then the denominators of these two terms just become one and I don't need to worry about them and then I'm going to represent my 2d space using a 2d matrix where the horizontal axis represents values of X and the vertical axis represents values of T and I want to fill in all these entries that have a black dot in it the ones with the orange dot those are my boundaries so those are actually fixed throughout the computation so I'm not gonna do any computation on those those are just given to me as input so they's they could be heat sources if we're doing the heat simulation and then now I can actually write code to simulate simulate this equation so if I want to compute U of T plus 1 X I can just go up here and I see that that's equal to this thing over here and then I bring the negative UT X term to the right so I get UT of X plus alpha times u T X plus 1 minus 2 times u T X plus u T X minus 1 you know as I said before we just want to keep iterating this until the temperatures becomes stable so I'm going to proceed in time which in time is going up here and to compute one of these points so let's say this is UT plus 1x I need to know the value of UT X which is just the thing below me in the matrix and I also need to know UT X + 1 + UT X minus 1 and those are just blowme and diagonal to either the left or the right side so each value here just depends on three other values and this is called a three-point stance so this is the pattern that this equation is representing and in general a stencil computation is going to update each point in an array using a fixed pattern this is called a stencil so I'm going to do the same thing for all of the other points and here I'm going to compute all of the values of X for a given time step and then I move on to the next time step and then I keep doing this until the the distribution of temperatures becomes stable and then I'm done okay so these stencil computations are widely used in scientific computing they're used for weather simulations stock market simulations fluid dynamics image processing probability and so on so they're all they're used all over the place and science so this is a very important concept to know so let's say I I just ran the code as I showed you in the animation so I completed one row at a time before I moved on to the next row how would this code perform with respect to caching yes yeah so if ax is small this would do pretty well but what if X is much larger than the size of your cache yeah you would do badly and and why is that yeah so it turns out that there there could be some reuse here because when I compute the second row I'm actually using some values that are computed in the first row but if the row is much larger than the cache size by the time I get to the second row the values that are loaded into cache from the first row without then evict 'add and therefore I'm going to suffer a cache miss again for the second row for the values I need to load and even though they could have been used if we made our code have good locality another question I have is if we only cared about the values of X at the most recent time stub do we actually have to keep around this whole 2d matrix or can we get by with less storage yeah so how many how many rows would I have to keep around if I only cared about the most recent time stub yeah - and why is that right so I need to keep around 2 rows because I need the row from the previous time step in order to compute the values in the current time stuff and after the current time step I can just swap the roles of the two rows that I'm keeping around and then reused a previous row for the next one yes so I I need to know when I'm competing in a second row I need to keep around all of these values that are computed in the first row because these values get fed to one of the computations in the second row so I need to actually keep all of them around yeah oh I see I see what you're saying yeah so that's actually good observation so you only need to keep a constant amount more storage because you'll just be overwriting the values as you go through the row so so if you keep around one row some of the values would be for the current times have been some of them would be from the previous time steps a good observation yes okay so so that code as we saw it wasn't very cache efficient you could make it cache efficient using tiling but we're gonna go straight to the cache oblivious algorithm because it's much cleaner so let's recall the ideal cache model we talked about this in the previous lecture so here we have a two-level hierarchy we have a cache and then we have the main memory and the cache kit the cache has size of M bites and a cache line is B bytes so you can keep around M over B cache lines in your cache and if something's in cash and you operate on it then it doesn't incur any cache misses but if you have to go to main memory to load the cache line in then you incur one cache miss the ideal cache model assumes that the cache is fully associative so any cache line can go anywhere in the cache and it also assumes either an optimal of mission replacement policy or the LRU policy so the optimal omission replacement policy knows the sequence of all future requests to memory and when it needs to evict something it's going to pick the thing that leads to the fewest cache misses overall to evict the LRU policy Jesse Vicks the things the thing that was least recently used but we saw from the previous lecture that in terms of asymptotic costs these two replacement policies will give you cache misses within a constant factor of each other so you can use either one depending on what's convenient and to performance measures that we care about when we're analyzing an algorithm in the ideal cash model are the work in the number of cache misses so that work it's just a total number of operations that the algorithm incurs and serially this is just the ordinary running time in a number of cache misses it's a number of cache lines you have to transfer between the cache and your main memory so let's assume that we're running an algorithm or analyzing an algorithm in the ideal cache model and it runs serially what kinds of cache misses does the ideal cache model not capture so remember we talked about several types of cache misses and there's one type of cache miss that this model doesn't capture when we're running serially so let's assume we're running this see really without any parallelism here so the sharing miss has only come about when you have parallelism yes yes so the answer is conflict misses and why is that why does this model not capture it yeah so this is a fully associative cache so any cache line can go anywhere in the cache and you can only get conflict misses for set associative schemes where each cache line can only be mapped to a particular set and if you have too many cache lines that map to that particular set then you're gonna keep evicting each other even though the rest of the cache could have space and that's what's called a conflict miss the ideal cache model does capture capacity misses so therefore it's still a very good model to use at a high level when you're designing efficient algorithms because I encourage use to encourage you to optimize for spatial and temporal locality and once you have a good algorithm in the ideal cache model then you can start dealing with conflict misses using some of the strategies that we talked about last time such as padding or using temporary memory so any questions on this okay so this is the code that does the heat simulation that we saw earlier so it's just two for loops the nested for loop in the outer loop we're looping over the time-dimension in the inner loop we're looping over the space dimension so we're computing all of the values of X before we move on to the next time step and then we're storing two rows here and we're using this trick called even-odd trick and here's how it works so to access the next row that we want to compute that we just do t plus 1 mod 2 and then tax as a current row is just T mod 2 so this is implicit or oh we're keeping around as we progress through time and then we're gonna set T plus U of T plus 1 mod to X equal to kernel of you a pointer to UT mod to X and this kernel function is defined up here and recall that when we're we're actually passing a pointer to this kernel function we can actually treat a pointer as the beginning of an array so we're using array notation up here inside the kernel function so that ray W is passed as input and then we need to return W of 0 that's just a element at the current pointer that we passed to kernel and then we add alpha times W of negative 1 that's 1 LM element before the thing that we're pointing to minus 2 times W of 0 plus W of one W of 1 is the next element that we're pointing to ok so let's look at the caching behavior of this code so we're gonna analyze the cache complexity and we're gonna assume the LRU replacement policy here because we can and as we said before we're going to loop through one entire row at a time before we go on to the next row so the number of cache misses I get assuming that n is greater than M so that the row size is greater than the cache size the number of cache misses is Theta of NT over B so how do I get this cache complexity bound here so how many cache misses do I have to incur for each row of this 2d space that I'm computing yes right so I need n over B cache misses for each row and this is because I can load in B bytes at a time so I benefit from spatial locality there and then I have n elements that I need to compute so it's theta n over B per row and as we said before when we get to the next row the stuff that we need from the previous row have already been evicted from cache so I basically have to incur theta n over B cache misses for every row and the number of rows I'm going to compute as T so it's just data of NT over B any questions on this analysis so how many do you think we can do better than this okay so one person two three okay so turns out that we can do better than this you can actually do better with tiling but I'm not gonna do the tiling version I want to do the cache oblivious version and the cache oblivious version is going to work on trapezoidal regions in the 2d space and recall that a trapezoid has a top base in a bottom base and here the top base is at t1 the bottom base is at T zero and the height is just t1 minus t0 in the width of a trapezoid it's just the width of it at the midpoint between t1 and t0 so at T 1 plus T 0 over 2 so we're going to compute all of the points inside this trapezoid that satisfy these inequalities here so T has to be greater than or equal to t0 and less than t1 and then X it's greater than or equal to x0 plus DX 0 times t minus t0 so DX 0 is actually the inverse slope here and then it also has to be less than x1 plus DX 1 times t minus t0 so DX 1 is the inverse slope on the other side and DX is 0 and DX 1 have to be either negative 1 0 or 1 so negative 1 just corresponds to inverse slope of negative 1 which is also a slope of negative 1 if it's 1 then it's just a slope or inverse low-proof 1 and then if it's 0 then we just have a vertical line okay so the nice property of this trapezoid is that we can actually compute everything inside the trapezoid without looking outside the trapezoid so we can compute everything here independently of any other trapezoids that we might be generating and we're gonna come up with a dividing conquer approach to execute this code so the divide and conquer algorithm has a base case so our base case is going to be when the height of the trapezoid is 1 and when the height is 1 then we're just going to compute all of the values using a simple loop can any order of the computation inside this loop is valid since we have all the values in the base of the trapezoid and we can compute the values in the top of the trapezoid in whatever order they don't depend on each other so that's the base case any questions so far so here's one of the recursive cases it turns out that we're gonna have two different types of cuts the first cut is called a space cut so I'm gonna do a space cut if the width of the trapezoid is greater than or equal to twice the height so this means that the trapezoid is too wide and I'm going to cut it vertically more specifically I'm going to cut it with a line with slope one negative one going through the center of the trapezoid and then I'm going to traverse the trapezoid on the left side first and then after I'm done with that traverse the trapezoid on the right side so can I can I actually switch the order of this can I compute the stuff on the right side before I do this stuff on the left side no why is that yeah so there's some points in the right trapezoid that depend on the values from the left trapezoid and so so for the left trapezoid every point we want to compute we already have all of its points assuming that we get all the values of the base points but for the right hand side some of the values depend on values in the left trapezoid so we can't execute the right trapezoid until we're done with the left trapezoid and this is the reason why I cut this trapezoid with a slope of negative one instead of using a vertical vertical cut because if I did a vertical cut then inside both of the trapezoids I would have points that depend on the other trapezoid so this is one of the two cuts it's called a space cut and it happens when the trapezoid is too wide the other cut is the time cut I'm gonna cut with respect to the time dimension and this happens when the trapezoid is too tall so when the width is less than twice the height of the trapezoid then what I'm gonna do is I'm just gonna cut it with a horizontal Zonta line through the center and then I'm going to traverse the bottom trapezoid first and after I'm done with that I can traverse a top trapezoid and again the top trapezoid depends on some points from the bottom trapezoid so I can't switch the order of those any questions okay so let's now look at the code that implements this recursive divide and conquer algorithm so here's the C code it takes as input t0 and t1 these are the coordinates of the top in the bottom of the trapezoid or bottom and top of the trapezoid then x0 is the left side of the trapezoid of the base of the trapezoid d x0 is the inverse slope and the x1 is the right side of the bottom base of the trapezoid and DX 1 is the inverse slope on the right side so we're first going to compute the height of our trapezoid and we're gonna let LT be the height and that's just t1 minus t0 and if the height is 1 then we're just going to use a for loop over all the points in that in that height 1 trapezoid okay we're going to call this kernel function that we defined before and otherwise the height is greater than 1 and we're gonna check whether we should do a space cut or a time cut so we do a space cut if the trapezoid is too wide and this condition inside the if Clause is checking if the trapezoid is too wide and if so then we're going to make two recursive calls to try out this oeid and we're gonna cut it in the middle using this slope of negative 1 so you see the negative ones here in the recursive calls and otherwise we'll do a time cut and for the time cut we just cut it in the middle so we cut it at the value of T that's equal to L T divided by 2 or t0 plus L T divided by 2 and again we have two recursive calls to trapezoid ok so even though I'm only generating slopes of negative 1 in this recursive call this code is gonna work even if I have slopes of 1 and 0 because I could start out with slopes of 1 and 0 for example if I had a rectangular region then the slopes are just gonna be 0 and this code is still gonna work but most of the slopes that I'm dealing with are going to be a negative 1 because those are the new slopes and I'm generating any questions so this code is very concise it turns out that the even odd trick still works here you can still keep around just two rows because your guarantee that you're not gonna write overwrite any value until all the all the things that depend on it are computed but when you're come when you're using just two rows the values in a particular row might not all correspond to the same time step because we're not finishing an entire row before we move on to the next one we're actually partially computing rows okay so let's analyze the cache complexity of this algorithm again we're going to use the recursion tree approach that we talked about last time and our our code is going to split itself into two subproblems at every level until it gets to a leaf and each leaf represents theta of HW points where H is Theta of W so H is the height double use the width and we have the property that H is equal to theta of W because of the nature of the algorithm when the trapezoid becomes too wide we're going to make it less wide by doing a space cut and if it becomes too tall we're going to do a horizontal cut so we're guaranteed that the height and the width are going to be within a constant factor of each other when we get to the base case and each leaf in the base case it's going to incur theta of W over B misses because we have to load in the base of the trapezoid from cache from main memory and once that's in cache we can compute all of the other points in the trapezoid without incurring any more cache misses so the cache misses per leaf is just theta W over B and we're gonna set W equal to theta of M in an in the analysis because that's the point when the trapezoid fits into cache the algorithm doesn't actually have any knowledge of this M parameter so it's still going to keep dividing conquering until I gets the base case of size 1 but just for the analysis we're going to use a base case when w is theta of M so the number of leaves we have is Theta of NT divided by HW because each leaf is of size theta HW the number of internal nodes we have is equal to a number of leaves minus one because we have a tree here but the internal nodes don't contribute substantially to the cache complexity because each of them is just doing a constant number of cache misses to set up the two recursive calls and it's not doing anything more expensive than that so we just need to compute the number of cache misses at the leaves we have theta of NT over HW leaves each one of which takes theta of W over B cache misses and we're going to multiply that out so the W term cancels out we just have NT over HB and we set H and W to be theta of M so we're just left with NT over MB as our cache complexity yes sure so each leaf only incurs theta W over B cache misses because we need to compute the values of the base of the trapezoid and that's going to incur theta of W over B cache misses because it's W wide and then for everything else it's going to fit into cache so when we compute them we already have all of our previous values that we want into cash in cache so so that's why it's not gonna incur any more cache misses does that make sense yeah okay so this was just analysis for one dimension but it actually generalizes to more than one dimensions so in general if we have D dimensions then the cache complexity is going to be theta of NT divided by M to the one over D times B so if d is one then that just reduces to NT over MB if D is two then it's going to be NT over B times square root of M and so on it turns out that this bound is also optimal so any questions so compared to the looping code this code actually has much better cache complexity it's saving by a factor of M the looping code had a cache complexity that was theta of NT over B it didn't have an M in the denominator okay so we're actually gonna do a simulation now we're gonna compare how the looping code and a trapezoid code runs and in this simulation the green points correspond to a cache hit the purple points correspond to a cache miss and we're gonna seem a fully associative cache using the LRU replacement policy where the cache line size is four points and the cache size is 32 points and we're going to set the cache hit latency to be one cycle and the cache miss latency to be ten cycles so an order of magnitude slower for cache misses and we're doing this for a rectangular region where and is 95 and we're doing it for 87 time steps so let me pull out the simulation now okay so on the left hand side we're gonna have the looping code on the right hand side we're going to have the trapezoidal code so let's start this so you can see that the looping code is going over one row at a time where the trapezoidal code is not doing that it's partially computing one row and then moving on to the next row okay I can also show the the cuts that are generated by the trapezoidal algorithm have to remember how to do this so see so so there are there's the carts that are generated by the trapezoidal algorithm and I can speed this up so you can see that the trapezoidal algorithm is incurring much fewer cache misses than the looping code so I'll just make this go faster and the the trapezoidal code is going to finish while the looping code is still slowly making its way up to top okay so it's finally done so any questions on this yeah in which of the regions so it's loading you get a cache miss for the purple dots here and then the cache line size is four so you get a cache miss for the first point and then you hit on the next three points and you get another cache miss for the fifth point and then you hit on the six seven and eight points yeah so for the one above it so we're assuming that the the two arrays fit in cache already so we don't actually need to load them from memory the thing above it just depends on the values that we have already computed and that fits in cache those are ready in cache this this base of the trapezoid is already in cash and the row right above it we just need to look at those values does that make sense yeah yeah any other questions on this so I don't want to see this again or yeah so I could let this run for the rest of the lecture but I have more interesting material that I want to talk about so let's just stop after this finishes and as you see again the looping code is slowly making its way to the top it's much slower than the trapezoid code okay so that was only for one dimensions now let's look at what happens in two dimensions and here I'm going to show another demo and this demo I'm actually going to run the code for real the previous demo was just a simulation so this is going to happen in real time and I'm going to simulate the heat in a 2d space and recall that the colors correspond to the temperatures so a brighter color means it's hotter so let me pull out the other demo okay so here my mouse cursor is the source of heat so you see that it's making the the points around my mouse cursor hot and then it's slowly diffusing its way to the other points now I can actually move this so that my heat source changes and then the heat I generated earlier slowly goes away okay so so in the lower left hand corner I'm showing the number of iterations per second of the code and we can see that the looping code is taking it's doing about fifteen hundred and sixty iterations per second let's see what happens when we switch to the trapezoid code so the trapezoid code is doing about eighteen hundred and thirty iterations per second so it's a little bit faster but not by to much does anyone have an idea why we're seeing this behavior so we said that the trapezoid code incurs many fewer cache misses than the looping code so we would expect it to be significantly faster but here it's only a little bit faster yeah right so that's that's a good point so in 2d you're only saving a factor of square root of M instead of M but square root of M is still pretty big compared to the speed-up we're getting here so any other ideas yeah so there is a constant factor in the trapezoidal code but even after accounting for the Trump for the constant factor you should still see a better speed-up than this so even accounting for the constant factors what other problem might be going on here yeah yeah so that's that's another good observation but the caches that we're using they still should get pretty good good cache I mean they should still have cache complexity that's pretty close to the ideal cache model I mean you might be off by a small constant factor but not by too much yeah sorry could you repeat that [Music] yes oh okay so if I move the cursor it's probably gonna be a little bit slower you know so gust slower by a little bit but that doesn't really affect the performance I can just leave my cursor there and this is what the iterations per second is yes [Music] - maybe we're seeing the crow's-feet yeah so there is some other factor dominating does anyone have an idea what that factor what might be no it's not it's not the rendering I I ran the code without showing the graphics and performance was similar yes yes yeah so there could be other things using the cache but that would be true for both of the programs and I don't actually have anything that's in sense of running set for PowerPoint I don't think that uses much of my cache all right so let's look at why this is the case so it turns out that the hardware is actually helping the looping code so the question is how come the cache oblivious trapezoidal code can have so many fewer cache misses but that Vantage gained over the looping version is so marginal it turns out that for the looping code the hardware is actually helping it by doing Hardware prefetching and hardware prefetching for the looping code is actually pretty good because the access pattern is very regular it's just going one row at a time so the hardware can predict the memory access pattern of the looping code and therefore it can bring in the cache lines that the looping code would need before it actually gets to that part of the computation so prefetching is helping the looping code and it's not helping the trapezoidal code that much because the access pattern is less regular there and prefetching does use memory bandwidth but when you're using just a single core you have more than enough bandwidth to take advantage of the hardware prefetching capabilities of the machine but later on we'll see that when we're running in parallel the memory bandwidth does become more of an issue when you have multiple processors all using the memory yeah question thank you [Music] you can do software prefetching there are there are instructions to do software prefetching hardware prefetching is usually more efficient but it's likes less flexible than the software prefetching but here we're not actually doing that we're just taking advantage of hardware prefetching yeah so we we didn't actually try that it could benefit a little bit if we use a little bit of software prefetching although I think it would benefit the looping code probably as well if we did that yes sorry sorry uh yeah so hardware prefetching is enabled it's always done by the hardware on the machines that we're using today so this was a this is a pretty surprising result but we'll actually see the fact of the memory bandwidth later on when we look at the parallel code any other questions before I continue okay so let's now look at the interplay between caching and parallelism so this was a theorem that we proved in the previous lecture so let's just recall what it says says let choose a P be the number of cache misses in a deterministic still computation when run on P processors each with a private cache and let S sub P be a number of successful steals during the computation in the ideal cache model with a cache size of M in a block size of B the number of cache misses on P processors equal to the number of cache misses on one processor plus order number of successful steals times M over B and last time we also said that the number of successful steals is upper bounded by the span of the computation times the number of processors if you minimize the span of your computation then you can also minimize the number of successful steals and then for low span algorithms the first term is usually going to dominate the q1 term so I'm not going to go over the proof we did that last time the moral the story is that minimizing cache misses in the serial illusion essentially minimizes them in the parallel execution assuming that you have a low span algorithm so let's see whether our trapezoidal algorithm works in parallel so does the space cut work in parallel recall that the space cut I'm cutting it with a slope of negative 1 through the center because it's too wide so can I execute the two trapezoids in parallel here no the reason is that the right travel SOI depends on the result of the left trapezoid so I can't execute them at the same time but there is a way that I can execute traps always in parallel so instead of just doing the cut through the center I'm actually going to do a V cut so now I have three trapezoids the two trapezoids in black I can't actually execute those in parallel because they're independent and everything in those two trapezoids just depends on the base of that trapezoid and after I'm done with the two trapezoids labelled one then I can X I can compute the trapezoid label 2 and this is known as a parallel space cut it produces two black trapezoids as well as a gray trapezoid in the blue to black trapezoids execute in parallel and afterwards the gray trapezoid executes and this is done recursively as well any questions yeah no ok we also had the time cut oh sorry so if the trapezoid is inverted then we're going to do this upside down v cut and in this case we're going to execute the middle trapezoid before we execute the two trapezoids on the side for the for the time cut it turns out that we're just gonna use the same cut as before and we're just gonna execute the two trapezoids serially so we do get a little bit of parallelism here from the parallel space cut let's look at how the parallel codes perform now so okay so this was a serial looping code here's the parallel looping code so we had 1450 before about 3700 now so little over twice the speed up and this is on a four core machine it's just on my laptop sorry oh yeah sure yeah slowing down a little bit but not by too much okay let's look at the trapezoidal code No so as we saw before the trapezoid code does about 1840 iterations per second and we can paralyze this so now it's doing about fifty three fifty iterations per second so it's getting about a factor of three speed-up okay I can move it around a little bit more if you want to see you Susan serial trapezoid and parallel trapezoid is everyone happy okay because I had to do this in real time the input size wasn't actually that big so I ran the experiment offline without the rendering on a much larger input so this input is a three thousand by three thousand grid and I did this for one thousand time steps using four processor cores and my cache size is eight megabytes the last level cache size so the input size here is much larger than my last level cache size and here are the times that I got so the serial looping code took about 129 seconds and when we did it in parallel it was about 66 seconds so got about a factor of two speed-up which is consistent with what we saw for the trapezoidal code it actually got a better speed up well we increase the input size so I got about a factor of four speed-up and this is because for larger input size the cache cache efficiency plays a much larger role because the cache is so small compared to our input size so here we see that the parallel looping code achieves less than half of the potential speed up even though the parallel looping code has much more potential parallelism than the trapezoidal code so a trapezoidal code only had a little bit of parallelism only for the space cuts whereas the the trapezoidal code is actually getting pretty good speed up so this is near linear speed-up since I'm using four cores in this getting three point nine six X speed-up so what could be going on here another thing to look at is to compare the serial trapezoid code with the serial looping code as well as the parallel trapezoid code with the peril parallel looping code so if you look at the serial trapezoid code you see that it's about twice as fast as the serial looping code but the parallel trapezoidal code is about four times faster than the parallel looping code and the reason here is that the the hardware prefetching can't help the parallel looping code that much because when you're running in parallel all of the cores are using memory and there's a memory bandwidth bottleneck here and prefetching actually needs to use memory bandwidth so in the serial case we had plenty of memory bandwidth you we could use for prefetching but in the parallel case we had we don't actually have much parallel but much memory bandwidth we can use here so that's why in a parallel case the trapezoidal code gets better speed up over the parallel looping code compared to the serial case and the trapezoid code also gets better speed up because because it's actually it does things more locally so it needs to use less fewer memory operations and there's a scalability bottleneck at the memory interconnect but because the trapezoidal code is cache oblivious it does a lot of work in cache whereas the looping code does more accesses to the main memory any questions on this okay so how do we know when we have when we have a parallel speed-up bottleneck how do we know what what's causing it so there are several main things that we should look at so we should see if our algorithm has insufficient parallelism whether the scheduling overhead is dominating whether we have a lack of memory bandwidth or whether there's contention going on in contention can refer to either locking or true and false sharing which we talked about in the last lecture so first two are usually quite easy to diagnose you can compute the work and the span of your algorithm and from that you can get the parallelism you can also use silk scale to help you diagnose the first two problems because silk scale can tell you how much parallelism you're getting in your code and it can also tell you the burden parallelism which includes the scheduler overhead what about for memory bandwidth how can we diagnose that says anyone have any ideas so I can tell you one way to do it I can open up open up my hardware and take out all of my memory chips except for one of them and run my serial code and if it slows down then that means it was probably memory bandwidth bound when I did it in parallel but that's a pretty heavy-handed way to diagnose this problem is there anything we can do with just software yes yeah so you could use a tool like valgrind to count the number of memory accesses yes [Music] so you can make sure only one processor is being used but you can't you can't but it might be using like more that more memory than just the memory from one chip there's actually a simpler way to do this the idea is that we'll just run P identical copies of the serial code and then dole all be executing in parallel and if the execution slows down that that means that they were probably contending for memory bandwidth does that make sense one caveat of this is you can only do this if you have enough physical memory because if when you're running P identical copies you have to use more DRAM than if you just ran one copy so you have to have enough physical memory but often times you can usually isolate some part of the code that you think has a performance bottleneck and just execute that part of the program with P copies in parallel and hopefully that will take less memory there also hardware counters you can check if you have root access to your machine that can measure how much memory bandwidth your program is using but this is a pretty simple way to do this so there are ways to diagnose lack of memory bandwidth turns out that contention is much harder to diagnose there are tools that exists that detect lock contention in an execution but usually they only detect the contention when the condensor actually happens but the contention doesn't have to happen every time you run your code so these tools don't detect the potential for lock lock contention and potential for true and false sharing is even harder to detect especially false sharing because if you're using a bunch of variables in your code you don't know which of those mapped to the same cache line so this is much harder to detect usually when you're trying to debug the speed-up bottleneck in your code you would first look at the first three things here and then once you eliminated those first these things then you can start looking at whether contention is causing the problem any questions okay so talked about stencil computation I want to now talk about another problem sorting and we want to do this cache efficiently okay so let's first analyze the cache complexity of a standard merge sort so we first need to analyze the complexity of merging because this is used as a subroutine in merge sort and as you recall in merging we're given two sorted input arrays and we want to generate an output array that's also sorted containing all the elements from the two input arrays and the algorithm is going to maintain a pointer to the head of each of our input arrays and then it's going to compare the two elements and take the smaller one and put it into the output and then increment the pointer for that array and then we keep doing this taking the smaller of the two elements until the two input arrays become empty at which point we're done with the algorithm we have one sorted output array okay so to merge n elements the time to do this is just theta of n here n is the sum of the sizes of my two input arrays and this is because I'm only doing a constant amount of work for each of my input elements what about the number of cache misses how many cache misses why why incur when I'm merging n elements yes yeah so I'm going to incur theta n over B cash misses because my two input arrays are stored contiguously in memories I can read them in B bytes at a time with just one cache miss and then my output array is also stored contiguously so I can write things out B bytes at a time with just one cache miss I might waste waste a cache line at the beginning and end of each of my three arrays but that only affects the bound by a constant factor so it's theta of n over B so now let's look at merge sort so recall that merge sort has two recursive calls to itself on inputs of half the size and then it doesn't merge at the end to merge the two sorted outputs of its recursive calls so if you look at how the recursion proceeds its first going to divide the input array into two halves it's going to divide it into two halves again again until you get to the base case of just one element at which point you return and then now we start merging pairs of these elements together so now I have these arrays of size two in sorted order and I emerged pairs of those arrays and I get sub arrays of size four and that finally I do this one more time to get my sorted output ok so let's review the work of merge sort so what's the recurrence for merge sort if we're competing the work yes this yeah so that's correct so I have two sub problems of size n over two and that I need to do theta and work to do the merge and this is case two of master theorem so I'm computing log base B of A which is log base 2 of 2 in dashes 1 and that's the same as the exponent in the term that I'm adding in so since they're the same I add in an additional log factor and my overall work is just theta of n log n okay so now I'm gonna solve this recurrence again using the recursion tree method I'm still gonna get theta of n log n but I'm doing this because it's gonna be useful when we analyze the cache complexity so the top level I have a problem of size N and I'm going to branch into two problems of size n over two and when I'm done with them I have to do a merge which takes theta n work and I'm just putting n here I'm ignoring the constants and I'm got branch again and each one of these is gonna do n over 2 work to merge and I'm gonna get all the way down to my base case after log base two of n levels the top level is doing and work second level is also doing n work and it's gonna be the same for all levels down to the leaves leaves is also doing linear amount of work so the overall work is just the work per level times the number of levels so it's just eight of n log N any questions okay so now let's analyze this with caching okay so we said earlier that the cache complexity of the merging subroutine is theta of n over B and here's the recurrence for the cache complexity of merge sort so my base case here is when n is less than or equal to CM for some sufficiently small constant C and this is because at this point my problem size fits into cache and everything else I do for that problem is still going to be in cash and to load it into cash the the base case I need to incur theta and over B cache misses and otherwise I'm going to have two recursive calls of size n over two and then I need to do theta and over B cache misses to do the merge of my two results okay so here my base case is larger than when I what I did for the work the algorithm actually stole recursing down to a constant massage base case but just for analysis I'm stopping the recursion when and is less than or equal to CM so let's analyze this so again I'm going to have the problems of size n at beginning and then I'm gonna split it into two problems of size n over two and then I'm gonna have to pay and over B cache misses to merge the results together similarly for the next level now I'm paying and over to be cache misses for each of my two problems here to do the merge and I keep going down until I get to a sub problem of size theta of CM at that point is going to fit in cache and I don't need to recurse anymore in my analysis so a number of levels of this recursion tree it's just log base two of n over cm so I'm basically chopping off the bottom of this recursion tree the number of levels I had below this is log base two of CM so I'm taking log base two of n and subtracting log base two of CM and that's equivalent to log base two of n divided by CM the number of leaves I have is n over cm since each leaf is Theta of CM large in the number of cache misses I need to incur to process a leaf it's just theta of M over B because I just need to load the the problem I just need to load the input for that sub problem into cache and then everything else fits in cache so for the top level I'm incurring and over B cache misses the next level I'm also incurring n over B cache misses same with the third level and then for the leaves I'm incurring M over B times and over C on cache misses the n over cm is the number of leaves I have in theta of M over B is the number of cache misses per leaf and that also equals theta of n over B so overall I multiply theta n over B by the number of levels I have so the number of levels I have is log base two of n over cm and I just got rid of the constant here since doesn't affect asymptotic bound so the number of cache misses I have a state of n over B times log base two of or any log any base for the log log of n over m so any questions on this analysis Sok I am saving a factor of B here in the first term so that does reduce that just gives me a better cache complexity than just a work bound but for the M term it's actually inside the denominator of the log and that doesn't actually help me that much so let's look at how much we actually save so when n is much greater than M then log base 2 of n over m is approximately equal to log base 2 of n so the M term doesn't contribute much to the asymptotic cost and therefore compared to the work we're only saving a factor of B when n is approximately equal to M then log base 2 of n over m is constant and we're saving a factor of B log n so we save more when the memory size or when the problem size is small but for large enough problem sizes we're only saving a factor of B so does anyone think that we can do better than this so I've asked this question several times before and the answer is always the same yes it's a good answer so we're gonna do this using multi way merging so instead of just merging two sorted sub-arrays we're gonna merge together our sorted sub-arrays so we're going to have our sub arrays each of size and over R and these are sorted and I'm gonna merge them together using what's called a tournament tree so how determinate tournament tree works is I'm going to compare the heads of each pair of these sub arrays and store it in the node of the tournament tree and then after I do that I compare these two nodes and I get the minimum of those two eventually after I compare all all the elements I'm just left with the minimum element at the root of the tournament tree and then I can place that into my output so the first time I want to fill this tournament tree it's going to take theta of our work because there are nodes in my tournament tree so when I compare these two elements the smaller one is six for these two the smaller one is two and then I compare two and six take the smaller one and then on the other side I have a seven here so I compare two and seven two is smaller so it appears at the root and then I know that that's going to be my minimum element among the heads of all of the our sub arrays that I'm passing it so the first time to generate this tournament tree takes theta of our work because I have to fill in our nodes but once I generated this tournament tree for all subsequent rounds I only need to fill in the path from the element that one to the output or rate or to the root of the tournament tree because those are the only values that would have changed so now I'm going to fill in this path here and this only takes theta of log our work to do it because the the height of this tournament tree is log base 2 of R so I'm going to fill this in now 14 goes here 6 is a smaller of the two and then 6 is a smaller to 2 again so my next element is 6 and I keep doing this until all of the elements from my our sub arrays get put into the output the total work for merging is going to be theta of R for the first round plus n log R for all the remaining rounds and that's just equal to theta of n log R because we're assuming that n is bigger than R here any questions on how the multi way merge works No okay so let's analyze the work of this multi way merge that when used in size inside merge sort so recurrence is gonna be as follows so if n is 1 then we just do a cost amount of work otherwise we have our subproblems of size and over R and then we're paying stative n log R to do the multi way merge so here's the recursion tree at the top level we have a problem of size n and then we're gonna split into our subproblems of size n of R and that we have to pay and log our work to merge the results of the recursive call together and then we keep doing this turns out that the work at each level sums sums up to n log based let n log base 2 of R in the number of levels we have here is log base R of n because each time we're branching by a factor of R for the leaves we have n leaves and we just pale in your work for that because we don't have to pay for the merge we're not doing any more recursive calls so the overall work is going to be theta of n log R times log base R of n plus n for the leaves but that's just a lower order term and if you work out the math some of these terms are gonna cancel out and you just get theta of n log n for the work so the work is the same as the binary merge sort let's now analyze the cache complexity so let's assume that we have R less than C M over B for a sufficiently small constant C less than or equal to 1 we're going to consider the our way merging of contiguous arrays of total size N and if R is less than C M over B then we can fit the entire tournament tree into cache and we can also fit one block from each of the R sub arrays into cache and in that case the total number of cache misses to do the multi way merge is just theta and over B because we just have to go over the end elements in our input arrays so the recurrence for the are way merge sort is as follows so if n is less than cm then it fits in cash it's a number of cache misses we page the state of n over B and otherwise we have our subproblems of size and over R and that we add theta and over B to do the merge of the results of the subproblems any questions on the recurrence here yes good question so we didn't pick the value of R so this is not a cache oblivious algorithm and we'll see what to choose for our in a couple slides so let's analyze the cache complexity of this algorithm again using the recursion tree analysis so at the top level we're going to have our subproblems of size n over R and we have to pay and over B cache misses to merge them and it turns out that at each level the number of cache misses we have to pay is n over B if you work out the math and the number of levels of this recursion tree is going to be log base R of n over cm because we're going to stop recursing when our problem size is cm and on every level of recursion we're branching by a factor of R so our leaf our leaf size is cm therefore the number of leaves we have is n over cm and for each leaf it's going to take theta of M over B cache misses to load it into cache and afterwards we can do everything in cache and multiplying the number of leaves by the cost per leaf we get theta n over B cache misses and therefore the number of cache misses is n over B times the number of levels number of levels is log base R of n over m so compare to the binary merge sort algorithm here we actually have a factor of R in the base of the log before the the base of the log wishes to so now the question is what we're gonna set our to be so again we have a voodoo parameter this is not a cache oblivious algorithm and we said that R has to be less than or equal to CM / B in order for the analysis to work out so let's just make it as large as possible let's just set R equal to theta of M over B and now now we can see what this complexity works out to be so we have the tall cache assumption we also have the fact that log base M of n over m is equal to theta of log n over log M so the cache complexity estate of n over B times log base M over B of n over m but if we have the tall cache assumption that log base M over B is the same as log base of M asymptotically so that's how we get to the second line and then you can rearrange some terms and cancel some terms out and we'll end up with theta n log n divided by B log M so we're saving a factor of theta of B log M compare to the work of the algorithm whereas for the binary version of mercero we were only saving a factor of B for large enough inputs so here we get another factor of log m in our savings so as I said the binary 1 cache misses is n log n over B whereas the multi-way one is n log n over B log m and as long as n is much greater than M then we're actually saving more much more than the binary version so we're saving a factor of log M in cache misses and let's just ignore the constants here and look at what log m can be in practice so here are some typical cache sizes the l1 cache size is 32 kilobytes so that's to 215th and log base 2 of that is 15 so we get a 15 X savings and then for the larger cache sizes we get an even larger saving sign any questions on this so the problem with this algorithm is that it's not cache-oblivious so we have to tune the value of R for a particular machine and even when we're running on the same machine there could be other jobs running that content for the cache turns out that there are several cache oblivious sorting algorithms the first one that was developed was by paper by charles larsen and it's called funnel sort the idea here is to recursively sort end to the 1/3 groups of n to the 2/3 elements and then merged the sorted groups with an end to the 1/3 funnel and this funnel is called a cage K funnel more generally and it merges together K cubed elements and K sorted lists and the cost for doing this merge is as shown here and you plug this into the recurrence you'll get that the asymptotic number of cache misses is the same as that of the multi way merge sort algorithm while being cache oblivious and this bound is actually optimal so I'm not gonna have time to talk about the details of the funnel sort algorithm but I do have time to just show you a pretty picture of what the funnel looks like so this is what a K funnel looks like it's recursively constructed we have a bunch of square root of K funnels inside they're all connected to some buffers so they feed alamos to buffer and then the buffers feed element to this output square root of K funnel which outputs to which becomes the output for the K funnel so this whole blue thing is the K funnel and the small green triangles are square root of K funnels and the number of cache misses incurred by doing this merge is shown here and it uses the tall cache assumption for analysis so a pretty cool algorithm and we posted a paper online that describes this algorithm so if you're interested in the details I encourage you to read that paper there are also many other cache oblivious algorithms out there so that there's been hundreds of papers on cache oblivious algorithms here are some of them some of these are also described in the paper in fact I think all of these are described in the paper we posted so some cool cache-oblivious data structures have been developed such as for b-trees ordered filed maintenance and priority queues so it's a lot of literature on cache oblivious algorithms and there's also a lot of material online if you're interested in learning more 3 00:00:05,769 --> 00:00:08,019 4 00:00:08,019 --> 00:00:09,850 5 00:00:09,850 --> 00:00:10,930 6 00:00:10,930 --> 00:00:13,120 7 00:00:13,120 --> 00:00:15,160 8 00:00:15,160 --> 00:00:22,590 9 00:00:22,590 --> 00:00:24,730 10 00:00:24,730 --> 00:00:27,850 11 00:00:27,850 --> 00:00:29,530 12 00:00:29,530 --> 00:00:39,010 13 00:00:39,010 --> 00:00:43,600 14 00:00:43,600 --> 00:00:47,410 15 00:00:47,410 --> 00:00:51,400 16 00:00:51,400 --> 00:00:53,380 17 00:00:53,380 --> 00:00:56,110 18 00:00:56,110 --> 00:00:57,580 19 00:00:57,580 --> 00:00:59,830 20 00:00:59,830 --> 00:01:01,720 21 00:01:01,720 --> 00:01:03,910 22 00:01:03,910 --> 00:01:06,520 23 00:01:06,520 --> 00:01:08,980 24 00:01:08,980 --> 00:01:12,249 25 00:01:12,249 --> 00:01:14,200 26 00:01:14,200 --> 00:01:16,359 27 00:01:16,359 --> 00:01:18,160 28 00:01:18,160 --> 00:01:19,840 29 00:01:19,840 --> 00:01:22,960 30 00:01:22,960 --> 00:01:24,810 31 00:01:24,810 --> 00:01:31,210 32 00:01:31,210 --> 00:01:33,550 33 00:01:33,550 --> 00:01:37,300 34 00:01:37,300 --> 00:01:40,810 35 00:01:40,810 --> 00:01:44,109 36 00:01:44,109 --> 00:01:47,200 37 00:01:47,200 --> 00:01:50,020 38 00:01:50,020 --> 00:01:51,789 39 00:01:51,789 --> 00:01:55,800 40 00:01:55,800 --> 00:01:59,469 41 00:01:59,469 --> 00:02:02,230 42 00:02:02,230 --> 00:02:03,760 43 00:02:03,760 --> 00:02:07,450 44 00:02:07,450 --> 00:02:11,590 45 00:02:11,590 --> 00:02:13,210 46 00:02:13,210 --> 00:02:16,000 47 00:02:16,000 --> 00:02:17,890 48 00:02:17,890 --> 00:02:22,390 49 00:02:22,390 --> 00:02:25,060 50 00:02:25,060 --> 00:02:28,479 51 00:02:28,479 --> 00:02:31,120 52 00:02:31,120 --> 00:02:33,490 53 00:02:33,490 --> 00:02:35,350 54 00:02:35,350 --> 00:02:39,400 55 00:02:39,400 --> 00:02:44,260 56 00:02:44,260 --> 00:02:45,850 57 00:02:45,850 --> 00:02:48,190 58 00:02:48,190 --> 00:02:50,050 59 00:02:50,050 --> 00:02:54,220 60 00:02:54,220 --> 00:02:56,640 61 00:02:56,640 --> 00:02:58,420 62 00:02:58,420 --> 00:03:01,270 63 00:03:01,270 --> 00:03:03,130 64 00:03:03,130 --> 00:03:04,780 65 00:03:04,780 --> 00:03:10,900 66 00:03:10,900 --> 00:03:14,290 67 00:03:14,290 --> 00:03:16,290 68 00:03:16,290 --> 00:03:18,970 69 00:03:18,970 --> 00:03:20,890 70 00:03:20,890 --> 00:03:22,780 71 00:03:22,780 --> 00:03:24,970 72 00:03:24,970 --> 00:03:28,090 73 00:03:28,090 --> 00:03:30,550 74 00:03:30,550 --> 00:03:31,720 75 00:03:31,720 --> 00:03:35,080 76 00:03:35,080 --> 00:03:37,060 77 00:03:37,060 --> 00:03:38,860 78 00:03:38,860 --> 00:03:41,590 79 00:03:41,590 --> 00:03:43,240 80 00:03:43,240 --> 00:03:46,150 81 00:03:46,150 --> 00:03:48,600 82 00:03:48,600 --> 00:03:51,400 83 00:03:51,400 --> 00:03:53,949 84 00:03:53,949 --> 00:03:57,759 85 00:03:57,759 --> 00:03:59,949 86 00:03:59,949 --> 00:04:01,660 87 00:04:01,660 --> 00:04:08,860 88 00:04:08,860 --> 00:04:10,300 89 00:04:10,300 --> 00:04:11,620 90 00:04:11,620 --> 00:04:14,110 91 00:04:14,110 --> 00:04:16,570 92 00:04:16,570 --> 00:04:19,380 93 00:04:19,380 --> 00:04:22,930 94 00:04:22,930 --> 00:04:25,210 95 00:04:25,210 --> 00:04:27,250 96 00:04:27,250 --> 00:04:30,030 97 00:04:30,030 --> 00:04:32,980 98 00:04:32,980 --> 00:04:35,770 99 00:04:35,770 --> 00:04:37,659 100 00:04:37,659 --> 00:04:40,450 101 00:04:40,450 --> 00:04:42,010 102 00:04:42,010 --> 00:04:42,020 103 00:04:42,020 --> 00:04:44,450 104 00:04:44,450 --> 00:04:46,970 105 00:04:46,970 --> 00:04:50,330 106 00:04:50,330 --> 00:04:52,640 107 00:04:52,640 --> 00:04:55,940 108 00:04:55,940 --> 00:04:57,950 109 00:04:57,950 --> 00:05:00,980 110 00:05:00,980 --> 00:05:04,850 111 00:05:04,850 --> 00:05:07,750 112 00:05:07,750 --> 00:05:11,510 113 00:05:11,510 --> 00:05:14,930 114 00:05:14,930 --> 00:05:19,670 115 00:05:19,670 --> 00:05:23,180 116 00:05:23,180 --> 00:05:24,560 117 00:05:24,560 --> 00:05:27,530 118 00:05:27,530 --> 00:05:31,400 119 00:05:31,400 --> 00:05:34,130 120 00:05:34,130 --> 00:05:35,630 121 00:05:35,630 --> 00:05:38,890 122 00:05:38,890 --> 00:05:41,060 123 00:05:41,060 --> 00:05:43,250 124 00:05:43,250 --> 00:05:47,210 125 00:05:47,210 --> 00:05:50,090 126 00:05:50,090 --> 00:05:51,710 127 00:05:51,710 --> 00:05:56,390 128 00:05:56,390 --> 00:05:59,840 129 00:05:59,840 --> 00:06:03,500 130 00:06:03,500 --> 00:06:05,540 131 00:06:05,540 --> 00:06:07,010 132 00:06:07,010 --> 00:06:08,840 133 00:06:08,840 --> 00:06:10,610 134 00:06:10,610 --> 00:06:12,560 135 00:06:12,560 --> 00:06:14,060 136 00:06:14,060 --> 00:06:16,100 137 00:06:16,100 --> 00:06:18,550 138 00:06:18,550 --> 00:06:21,410 139 00:06:21,410 --> 00:06:23,270 140 00:06:23,270 --> 00:06:24,560 141 00:06:24,560 --> 00:06:26,840 142 00:06:26,840 --> 00:06:29,960 143 00:06:29,960 --> 00:06:35,120 144 00:06:35,120 --> 00:06:37,280 145 00:06:37,280 --> 00:06:38,810 146 00:06:38,810 --> 00:06:42,080 147 00:06:42,080 --> 00:06:44,060 148 00:06:44,060 --> 00:06:45,830 149 00:06:45,830 --> 00:06:47,930 150 00:06:47,930 --> 00:06:49,550 151 00:06:49,550 --> 00:06:51,100 152 00:06:51,100 --> 00:06:53,540 153 00:06:53,540 --> 00:06:55,730 154 00:06:55,730 --> 00:06:58,190 155 00:06:58,190 --> 00:07:00,380 156 00:07:00,380 --> 00:07:03,830 157 00:07:03,830 --> 00:07:06,020 158 00:07:06,020 --> 00:07:07,940 159 00:07:07,940 --> 00:07:10,970 160 00:07:10,970 --> 00:07:13,550 161 00:07:13,550 --> 00:07:18,080 162 00:07:18,080 --> 00:07:21,560 163 00:07:21,560 --> 00:07:24,170 164 00:07:24,170 --> 00:07:27,320 165 00:07:27,320 --> 00:07:31,400 166 00:07:31,400 --> 00:07:34,220 167 00:07:34,220 --> 00:07:37,250 168 00:07:37,250 --> 00:07:40,070 169 00:07:40,070 --> 00:07:42,290 170 00:07:42,290 --> 00:07:44,870 171 00:07:44,870 --> 00:07:47,120 172 00:07:47,120 --> 00:07:49,580 173 00:07:49,580 --> 00:07:55,610 174 00:07:55,610 --> 00:07:58,640 175 00:07:58,640 --> 00:08:00,290 176 00:08:00,290 --> 00:08:01,910 177 00:08:01,910 --> 00:08:05,300 178 00:08:05,300 --> 00:08:08,030 179 00:08:08,030 --> 00:08:09,860 180 00:08:09,860 --> 00:08:11,810 181 00:08:11,810 --> 00:08:15,610 182 00:08:15,610 --> 00:08:19,040 183 00:08:19,040 --> 00:08:21,070 184 00:08:21,070 --> 00:08:23,480 185 00:08:23,480 --> 00:08:23,490 186 00:08:23,490 --> 00:08:28,740 187 00:08:28,740 --> 00:08:32,940 188 00:08:32,940 --> 00:08:37,420 189 00:08:37,420 --> 00:08:39,160 190 00:08:39,160 --> 00:08:42,600 191 00:08:42,600 --> 00:08:46,930 192 00:08:46,930 --> 00:08:48,430 193 00:08:48,430 --> 00:08:49,960 194 00:08:49,960 --> 00:08:51,850 195 00:08:51,850 --> 00:08:54,940 196 00:08:54,940 --> 00:08:57,940 197 00:08:57,940 --> 00:08:59,590 198 00:08:59,590 --> 00:09:01,540 199 00:09:01,540 --> 00:09:04,810 200 00:09:04,810 --> 00:09:08,220 201 00:09:08,220 --> 00:09:12,580 202 00:09:12,580 --> 00:09:14,440 203 00:09:14,440 --> 00:09:17,860 204 00:09:17,860 --> 00:09:20,950 205 00:09:20,950 --> 00:09:24,340 206 00:09:24,340 --> 00:09:27,400 207 00:09:27,400 --> 00:09:28,780 208 00:09:28,780 --> 00:09:30,970 209 00:09:30,970 --> 00:09:32,380 210 00:09:32,380 --> 00:09:34,510 211 00:09:34,510 --> 00:09:37,270 212 00:09:37,270 --> 00:09:42,340 213 00:09:42,340 --> 00:09:46,060 214 00:09:46,060 --> 00:09:48,640 215 00:09:48,640 --> 00:09:53,590 216 00:09:53,590 --> 00:09:56,890 217 00:09:56,890 --> 00:09:59,050 218 00:09:59,050 --> 00:10:02,110 219 00:10:02,110 --> 00:10:06,610 220 00:10:06,610 --> 00:10:11,500 221 00:10:11,500 --> 00:10:12,940 222 00:10:12,940 --> 00:10:16,390 223 00:10:16,390 --> 00:10:19,210 224 00:10:19,210 --> 00:10:22,450 225 00:10:22,450 --> 00:10:29,680 226 00:10:29,680 --> 00:10:32,260 227 00:10:32,260 --> 00:10:34,900 228 00:10:34,900 --> 00:10:38,020 229 00:10:38,020 --> 00:10:41,560 230 00:10:41,560 --> 00:10:42,460 231 00:10:42,460 --> 00:10:45,310 232 00:10:45,310 --> 00:10:47,130 233 00:10:47,130 --> 00:10:50,620 234 00:10:50,620 --> 00:10:52,870 235 00:10:52,870 --> 00:10:54,340 236 00:10:54,340 --> 00:10:57,000 237 00:10:57,000 --> 00:10:59,500 238 00:10:59,500 --> 00:11:01,450 239 00:11:01,450 --> 00:11:04,810 240 00:11:04,810 --> 00:11:08,080 241 00:11:08,080 --> 00:11:09,490 242 00:11:09,490 --> 00:11:12,700 243 00:11:12,700 --> 00:11:14,830 244 00:11:14,830 --> 00:11:18,540 245 00:11:18,540 --> 00:11:23,560 246 00:11:23,560 --> 00:11:26,140 247 00:11:26,140 --> 00:11:29,220 248 00:11:29,220 --> 00:11:33,160 249 00:11:33,160 --> 00:11:34,950 250 00:11:34,950 --> 00:11:37,750 251 00:11:37,750 --> 00:11:40,690 252 00:11:40,690 --> 00:11:42,820 253 00:11:42,820 --> 00:11:44,320 254 00:11:44,320 --> 00:11:46,060 255 00:11:46,060 --> 00:11:50,640 256 00:11:50,640 --> 00:11:53,740 257 00:11:53,740 --> 00:11:55,840 258 00:11:55,840 --> 00:11:57,820 259 00:11:57,820 --> 00:12:00,430 260 00:12:00,430 --> 00:12:11,640 261 00:12:11,640 --> 00:12:11,650 262 00:12:11,650 --> 00:12:22,910 263 00:12:22,910 --> 00:12:28,470 264 00:12:28,470 --> 00:12:30,300 265 00:12:30,300 --> 00:12:34,500 266 00:12:34,500 --> 00:12:48,949 267 00:12:48,949 --> 00:12:51,360 268 00:12:51,360 --> 00:12:54,150 269 00:12:54,150 --> 00:12:56,550 270 00:12:56,550 --> 00:12:58,620 271 00:12:58,620 --> 00:13:00,840 272 00:13:00,840 --> 00:13:02,730 273 00:13:02,730 --> 00:13:04,500 274 00:13:04,500 --> 00:13:06,389 275 00:13:06,389 --> 00:13:08,670 276 00:13:08,670 --> 00:13:11,280 277 00:13:11,280 --> 00:13:13,230 278 00:13:13,230 --> 00:13:15,150 279 00:13:15,150 --> 00:13:17,400 280 00:13:17,400 --> 00:13:25,410 281 00:13:25,410 --> 00:13:28,199 282 00:13:28,199 --> 00:13:30,420 283 00:13:30,420 --> 00:13:32,970 284 00:13:32,970 --> 00:13:44,060 285 00:13:44,060 --> 00:13:46,400 286 00:13:46,400 --> 00:13:48,139 287 00:13:48,139 --> 00:13:51,920 288 00:13:51,920 --> 00:13:59,780 289 00:13:59,780 --> 00:14:02,059 290 00:14:02,059 --> 00:14:03,920 291 00:14:03,920 --> 00:14:05,540 292 00:14:05,540 --> 00:14:07,610 293 00:14:07,610 --> 00:14:09,379 294 00:14:09,379 --> 00:14:11,870 295 00:14:11,870 --> 00:14:14,090 296 00:14:14,090 --> 00:14:30,220 297 00:14:30,220 --> 00:14:33,140 298 00:14:33,140 --> 00:14:35,660 299 00:14:35,660 --> 00:14:37,580 300 00:14:37,580 --> 00:14:40,670 301 00:14:40,670 --> 00:14:43,190 302 00:14:43,190 --> 00:14:44,570 303 00:14:44,570 --> 00:14:54,870 304 00:14:54,870 --> 00:14:54,880 305 00:14:54,880 --> 00:14:58,519 306 00:14:58,519 --> 00:15:01,099 307 00:15:01,099 --> 00:15:03,289 308 00:15:03,289 --> 00:15:05,779 309 00:15:05,779 --> 00:15:09,199 310 00:15:09,199 --> 00:15:11,869 311 00:15:11,869 --> 00:15:14,209 312 00:15:14,209 --> 00:15:16,399 313 00:15:16,399 --> 00:15:17,809 314 00:15:17,809 --> 00:15:19,519 315 00:15:19,519 --> 00:15:29,119 316 00:15:29,119 --> 00:15:31,609 317 00:15:31,609 --> 00:15:33,499 318 00:15:33,499 --> 00:15:35,869 319 00:15:35,869 --> 00:15:37,569 320 00:15:37,569 --> 00:15:41,259 321 00:15:41,259 --> 00:15:44,599 322 00:15:44,599 --> 00:15:46,189 323 00:15:46,189 --> 00:15:48,559 324 00:15:48,559 --> 00:15:52,639 325 00:15:52,639 --> 00:15:55,549 326 00:15:55,549 --> 00:15:59,239 327 00:15:59,239 --> 00:16:01,429 328 00:16:01,429 --> 00:16:03,499 329 00:16:03,499 --> 00:16:07,819 330 00:16:07,819 --> 00:16:10,639 331 00:16:10,639 --> 00:16:12,259 332 00:16:12,259 --> 00:16:14,899 333 00:16:14,899 --> 00:16:18,710 334 00:16:18,710 --> 00:16:20,539 335 00:16:20,539 --> 00:16:23,509 336 00:16:23,509 --> 00:16:26,299 337 00:16:26,299 --> 00:16:28,759 338 00:16:28,759 --> 00:16:31,879 339 00:16:31,879 --> 00:16:33,909 340 00:16:33,909 --> 00:16:36,469 341 00:16:36,469 --> 00:16:39,109 342 00:16:39,109 --> 00:16:40,849 343 00:16:40,849 --> 00:16:42,499 344 00:16:42,499 --> 00:16:46,489 345 00:16:46,489 --> 00:16:49,099 346 00:16:49,099 --> 00:16:52,729 347 00:16:52,729 --> 00:16:54,829 348 00:16:54,829 --> 00:16:58,279 349 00:16:58,279 --> 00:16:59,629 350 00:16:59,629 --> 00:17:01,339 351 00:17:01,339 --> 00:17:03,649 352 00:17:03,649 --> 00:17:06,760 353 00:17:06,760 --> 00:17:09,730 354 00:17:09,730 --> 00:17:11,710 355 00:17:11,710 --> 00:17:14,890 356 00:17:14,890 --> 00:17:16,750 357 00:17:16,750 --> 00:17:18,220 358 00:17:18,220 --> 00:17:21,699 359 00:17:21,699 --> 00:17:23,850 360 00:17:23,850 --> 00:17:26,169 361 00:17:26,169 --> 00:17:27,370 362 00:17:27,370 --> 00:17:30,700 363 00:17:30,700 --> 00:17:36,130 364 00:17:36,130 --> 00:17:38,049 365 00:17:38,049 --> 00:17:39,730 366 00:17:39,730 --> 00:17:42,880 367 00:17:42,880 --> 00:17:45,040 368 00:17:45,040 --> 00:17:49,390 369 00:17:49,390 --> 00:17:51,040 370 00:17:51,040 --> 00:17:53,020 371 00:17:53,020 --> 00:17:54,580 372 00:17:54,580 --> 00:17:54,590 373 00:17:54,590 --> 00:18:06,070 374 00:18:06,070 --> 00:18:08,650 375 00:18:08,650 --> 00:18:11,860 376 00:18:11,860 --> 00:18:13,270 377 00:18:13,270 --> 00:18:16,990 378 00:18:16,990 --> 00:18:18,910 379 00:18:18,910 --> 00:18:26,220 380 00:18:26,220 --> 00:18:28,630 381 00:18:28,630 --> 00:18:31,180 382 00:18:31,180 --> 00:18:32,920 383 00:18:32,920 --> 00:18:35,230 384 00:18:35,230 --> 00:18:37,810 385 00:18:37,810 --> 00:18:40,510 386 00:18:40,510 --> 00:18:42,010 387 00:18:42,010 --> 00:18:43,480 388 00:18:43,480 --> 00:18:45,250 389 00:18:45,250 --> 00:18:47,380 390 00:18:47,380 --> 00:18:50,470 391 00:18:50,470 --> 00:18:52,750 392 00:18:52,750 --> 00:18:56,830 393 00:18:56,830 --> 00:18:58,930 394 00:18:58,930 --> 00:19:00,430 395 00:19:00,430 --> 00:19:02,560 396 00:19:02,560 --> 00:19:04,870 397 00:19:04,870 --> 00:19:06,880 398 00:19:06,880 --> 00:19:09,190 399 00:19:09,190 --> 00:19:10,990 400 00:19:10,990 --> 00:19:12,670 401 00:19:12,670 --> 00:19:14,320 402 00:19:14,320 --> 00:19:18,220 403 00:19:18,220 --> 00:19:28,640 404 00:19:28,640 --> 00:19:35,490 405 00:19:35,490 --> 00:19:38,220 406 00:19:38,220 --> 00:19:41,250 407 00:19:41,250 --> 00:19:43,110 408 00:19:43,110 --> 00:19:45,120 409 00:19:45,120 --> 00:19:46,710 410 00:19:46,710 --> 00:19:48,780 411 00:19:48,780 --> 00:19:50,640 412 00:19:50,640 --> 00:19:54,300 413 00:19:54,300 --> 00:19:57,840 414 00:19:57,840 --> 00:20:00,120 415 00:20:00,120 --> 00:20:03,660 416 00:20:03,660 --> 00:20:06,210 417 00:20:06,210 --> 00:20:09,480 418 00:20:09,480 --> 00:20:13,830 419 00:20:13,830 --> 00:20:15,420 420 00:20:15,420 --> 00:20:19,890 421 00:20:19,890 --> 00:20:23,220 422 00:20:23,220 --> 00:20:27,420 423 00:20:27,420 --> 00:20:30,630 424 00:20:30,630 --> 00:20:33,870 425 00:20:33,870 --> 00:20:35,550 426 00:20:35,550 --> 00:20:37,200 427 00:20:37,200 --> 00:20:41,070 428 00:20:41,070 --> 00:20:43,470 429 00:20:43,470 --> 00:20:46,260 430 00:20:46,260 --> 00:20:49,320 431 00:20:49,320 --> 00:20:52,650 432 00:20:52,650 --> 00:20:55,410 433 00:20:55,410 --> 00:20:58,830 434 00:20:58,830 --> 00:21:01,530 435 00:21:01,530 --> 00:21:04,230 436 00:21:04,230 --> 00:21:07,740 437 00:21:07,740 --> 00:21:13,470 438 00:21:13,470 --> 00:21:16,020 439 00:21:16,020 --> 00:21:18,480 440 00:21:18,480 --> 00:21:22,140 441 00:21:22,140 --> 00:21:24,810 442 00:21:24,810 --> 00:21:28,920 443 00:21:28,920 --> 00:21:31,050 444 00:21:31,050 --> 00:21:35,310 445 00:21:35,310 --> 00:21:37,500 446 00:21:37,500 --> 00:21:38,400 447 00:21:38,400 --> 00:21:41,580 448 00:21:41,580 --> 00:21:43,650 449 00:21:43,650 --> 00:21:46,020 450 00:21:46,020 --> 00:21:50,520 451 00:21:50,520 --> 00:22:04,010 452 00:22:04,010 --> 00:22:06,320 453 00:22:06,320 --> 00:22:09,260 454 00:22:09,260 --> 00:22:17,510 455 00:22:17,510 --> 00:22:22,100 456 00:22:22,100 --> 00:22:24,799 457 00:22:24,799 --> 00:22:27,049 458 00:22:27,049 --> 00:22:29,270 459 00:22:29,270 --> 00:22:31,250 460 00:22:31,250 --> 00:22:33,590 461 00:22:33,590 --> 00:22:35,870 462 00:22:35,870 --> 00:22:37,790 463 00:22:37,790 --> 00:22:39,950 464 00:22:39,950 --> 00:22:42,049 465 00:22:42,049 --> 00:22:44,870 466 00:22:44,870 --> 00:22:46,790 467 00:22:46,790 --> 00:22:51,740 468 00:22:51,740 --> 00:23:03,410 469 00:23:03,410 --> 00:23:10,630 470 00:23:10,630 --> 00:23:15,380 471 00:23:15,380 --> 00:23:16,850 472 00:23:16,850 --> 00:23:18,680 473 00:23:18,680 --> 00:23:20,120 474 00:23:20,120 --> 00:23:23,930 475 00:23:23,930 --> 00:23:26,180 476 00:23:26,180 --> 00:23:29,450 477 00:23:29,450 --> 00:23:32,840 478 00:23:32,840 --> 00:23:37,490 479 00:23:37,490 --> 00:23:41,270 480 00:23:41,270 --> 00:23:45,080 481 00:23:45,080 --> 00:23:47,360 482 00:23:47,360 --> 00:23:50,090 483 00:23:50,090 --> 00:23:55,330 484 00:23:55,330 --> 00:23:58,669 485 00:23:58,669 --> 00:24:02,270 486 00:24:02,270 --> 00:24:05,120 487 00:24:05,120 --> 00:24:07,250 488 00:24:07,250 --> 00:24:11,409 489 00:24:11,409 --> 00:24:16,190 490 00:24:16,190 --> 00:24:17,480 491 00:24:17,480 --> 00:24:20,990 492 00:24:20,990 --> 00:24:24,860 493 00:24:24,860 --> 00:24:26,870 494 00:24:26,870 --> 00:24:30,410 495 00:24:30,410 --> 00:24:33,650 496 00:24:33,650 --> 00:24:36,919 497 00:24:36,919 --> 00:24:38,480 498 00:24:38,480 --> 00:24:41,510 499 00:24:41,510 --> 00:24:44,660 500 00:24:44,660 --> 00:24:51,520 501 00:24:51,520 --> 00:24:55,430 502 00:24:55,430 --> 00:24:57,140 503 00:24:57,140 --> 00:24:59,030 504 00:24:59,030 --> 00:25:01,700 505 00:25:01,700 --> 00:25:02,990 506 00:25:02,990 --> 00:25:04,910 507 00:25:04,910 --> 00:25:07,130 508 00:25:07,130 --> 00:25:08,870 509 00:25:08,870 --> 00:25:14,900 510 00:25:14,900 --> 00:25:17,600 511 00:25:17,600 --> 00:25:20,240 512 00:25:20,240 --> 00:25:22,190 513 00:25:22,190 --> 00:25:25,130 514 00:25:25,130 --> 00:25:28,580 515 00:25:28,580 --> 00:25:32,060 516 00:25:32,060 --> 00:25:34,700 517 00:25:34,700 --> 00:25:38,540 518 00:25:38,540 --> 00:25:41,240 519 00:25:41,240 --> 00:25:43,130 520 00:25:43,130 --> 00:25:44,720 521 00:25:44,720 --> 00:25:49,070 522 00:25:49,070 --> 00:25:56,330 523 00:25:56,330 --> 00:25:58,940 524 00:25:58,940 --> 00:25:59,840 525 00:25:59,840 --> 00:26:02,150 526 00:26:02,150 --> 00:26:05,960 527 00:26:05,960 --> 00:26:08,390 528 00:26:08,390 --> 00:26:09,830 529 00:26:09,830 --> 00:26:11,720 530 00:26:11,720 --> 00:26:15,280 531 00:26:15,280 --> 00:26:20,330 532 00:26:20,330 --> 00:26:22,370 533 00:26:22,370 --> 00:26:24,890 534 00:26:24,890 --> 00:26:28,340 535 00:26:28,340 --> 00:26:30,740 536 00:26:30,740 --> 00:26:32,600 537 00:26:32,600 --> 00:26:35,000 538 00:26:35,000 --> 00:26:39,920 539 00:26:39,920 --> 00:26:41,840 540 00:26:41,840 --> 00:26:43,430 541 00:26:43,430 --> 00:26:52,529 542 00:26:52,529 --> 00:26:54,959 543 00:26:54,959 --> 00:26:56,729 544 00:26:56,729 --> 00:27:01,229 545 00:27:01,229 --> 00:27:03,599 546 00:27:03,599 --> 00:27:05,339 547 00:27:05,339 --> 00:27:07,159 548 00:27:07,159 --> 00:27:10,709 549 00:27:10,709 --> 00:27:12,389 550 00:27:12,389 --> 00:27:14,999 551 00:27:14,999 --> 00:27:16,979 552 00:27:16,979 --> 00:27:18,719 553 00:27:18,719 --> 00:27:21,509 554 00:27:21,509 --> 00:27:23,609 555 00:27:23,609 --> 00:27:25,559 556 00:27:25,559 --> 00:27:27,089 557 00:27:27,089 --> 00:27:29,129 558 00:27:29,129 --> 00:27:30,930 559 00:27:30,930 --> 00:27:35,399 560 00:27:35,399 --> 00:27:37,079 561 00:27:37,079 --> 00:27:40,229 562 00:27:40,229 --> 00:27:42,209 563 00:27:42,209 --> 00:27:44,519 564 00:27:44,519 --> 00:27:46,439 565 00:27:46,439 --> 00:27:49,169 566 00:27:49,169 --> 00:27:51,569 567 00:27:51,569 --> 00:27:53,579 568 00:27:53,579 --> 00:27:55,889 569 00:27:55,889 --> 00:27:58,709 570 00:27:58,709 --> 00:28:00,989 571 00:28:00,989 --> 00:28:02,399 572 00:28:02,399 --> 00:28:04,229 573 00:28:04,229 --> 00:28:05,909 574 00:28:05,909 --> 00:28:07,589 575 00:28:07,589 --> 00:28:16,510 576 00:28:16,510 --> 00:28:19,730 577 00:28:19,730 --> 00:28:22,010 578 00:28:22,010 --> 00:28:25,780 579 00:28:25,780 --> 00:28:30,290 580 00:28:30,290 --> 00:28:32,030 581 00:28:32,030 --> 00:28:33,920 582 00:28:33,920 --> 00:28:37,580 583 00:28:37,580 --> 00:28:39,410 584 00:28:39,410 --> 00:28:42,350 585 00:28:42,350 --> 00:28:44,990 586 00:28:44,990 --> 00:28:47,390 587 00:28:47,390 --> 00:28:50,930 588 00:28:50,930 --> 00:28:52,370 589 00:28:52,370 --> 00:28:54,920 590 00:28:54,920 --> 00:28:58,190 591 00:28:58,190 --> 00:29:00,830 592 00:29:00,830 --> 00:29:05,270 593 00:29:05,270 --> 00:29:08,240 594 00:29:08,240 --> 00:29:09,620 595 00:29:09,620 --> 00:29:13,130 596 00:29:13,130 --> 00:29:15,770 597 00:29:15,770 --> 00:29:17,540 598 00:29:17,540 --> 00:29:20,090 599 00:29:20,090 --> 00:29:22,520 600 00:29:22,520 --> 00:29:24,440 601 00:29:24,440 --> 00:29:26,210 602 00:29:26,210 --> 00:29:29,750 603 00:29:29,750 --> 00:29:32,170 604 00:29:32,170 --> 00:29:34,550 605 00:29:34,550 --> 00:29:36,740 606 00:29:36,740 --> 00:29:38,690 607 00:29:38,690 --> 00:29:41,660 608 00:29:41,660 --> 00:29:43,970 609 00:29:43,970 --> 00:29:46,790 610 00:29:46,790 --> 00:29:51,020 611 00:29:51,020 --> 00:29:53,870 612 00:29:53,870 --> 00:30:00,200 613 00:30:00,200 --> 00:30:02,480 614 00:30:02,480 --> 00:30:04,160 615 00:30:04,160 --> 00:30:07,130 616 00:30:07,130 --> 00:30:09,170 617 00:30:09,170 --> 00:30:11,780 618 00:30:11,780 --> 00:30:13,940 619 00:30:13,940 --> 00:30:15,950 620 00:30:15,950 --> 00:30:17,540 621 00:30:17,540 --> 00:30:19,010 622 00:30:19,010 --> 00:30:20,780 623 00:30:20,780 --> 00:30:24,620 624 00:30:24,620 --> 00:30:33,810 625 00:30:33,810 --> 00:30:37,090 626 00:30:37,090 --> 00:30:40,780 627 00:30:40,780 --> 00:30:44,170 628 00:30:44,170 --> 00:30:46,030 629 00:30:46,030 --> 00:30:48,220 630 00:30:48,220 --> 00:30:51,640 631 00:30:51,640 --> 00:30:55,180 632 00:30:55,180 --> 00:30:57,280 633 00:30:57,280 --> 00:30:59,470 634 00:30:59,470 --> 00:31:01,270 635 00:31:01,270 --> 00:31:03,190 636 00:31:03,190 --> 00:31:04,750 637 00:31:04,750 --> 00:31:10,020 638 00:31:10,020 --> 00:31:12,760 639 00:31:12,760 --> 00:31:18,070 640 00:31:18,070 --> 00:31:19,630 641 00:31:19,630 --> 00:31:24,030 642 00:31:24,030 --> 00:31:28,180 643 00:31:28,180 --> 00:31:30,520 644 00:31:30,520 --> 00:31:34,290 645 00:31:34,290 --> 00:31:39,580 646 00:31:39,580 --> 00:31:41,290 647 00:31:41,290 --> 00:31:45,370 648 00:31:45,370 --> 00:31:47,890 649 00:31:47,890 --> 00:31:49,840 650 00:31:49,840 --> 00:31:51,610 651 00:31:51,610 --> 00:31:55,180 652 00:31:55,180 --> 00:31:57,220 653 00:31:57,220 --> 00:31:59,260 654 00:31:59,260 --> 00:32:00,820 655 00:32:00,820 --> 00:32:02,350 656 00:32:02,350 --> 00:32:06,610 657 00:32:06,610 --> 00:32:08,920 658 00:32:08,920 --> 00:32:12,340 659 00:32:12,340 --> 00:32:14,380 660 00:32:14,380 --> 00:32:17,080 661 00:32:17,080 --> 00:32:19,870 662 00:32:19,870 --> 00:32:21,490 663 00:32:21,490 --> 00:32:23,290 664 00:32:23,290 --> 00:32:25,720 665 00:32:25,720 --> 00:32:30,010 666 00:32:30,010 --> 00:32:32,230 667 00:32:32,230 --> 00:32:35,200 668 00:32:35,200 --> 00:32:37,720 669 00:32:37,720 --> 00:32:40,900 670 00:32:40,900 --> 00:32:42,820 671 00:32:42,820 --> 00:32:45,130 672 00:32:45,130 --> 00:32:47,710 673 00:32:47,710 --> 00:32:49,660 674 00:32:49,660 --> 00:32:57,070 675 00:32:57,070 --> 00:32:59,260 676 00:32:59,260 --> 00:33:03,610 677 00:33:03,610 --> 00:33:09,430 678 00:33:09,430 --> 00:33:11,320 679 00:33:11,320 --> 00:33:13,240 680 00:33:13,240 --> 00:33:15,970 681 00:33:15,970 --> 00:33:17,980 682 00:33:17,980 --> 00:33:20,560 683 00:33:20,560 --> 00:33:22,510 684 00:33:22,510 --> 00:33:24,370 685 00:33:24,370 --> 00:33:26,560 686 00:33:26,560 --> 00:33:30,820 687 00:33:30,820 --> 00:33:33,760 688 00:33:33,760 --> 00:33:37,210 689 00:33:37,210 --> 00:33:39,520 690 00:33:39,520 --> 00:33:41,950 691 00:33:41,950 --> 00:33:44,230 692 00:33:44,230 --> 00:33:49,360 693 00:33:49,360 --> 00:33:52,570 694 00:33:52,570 --> 00:33:56,320 695 00:33:56,320 --> 00:33:56,330 696 00:33:56,330 --> 00:34:02,280 697 00:34:02,280 --> 00:34:06,000 698 00:34:06,000 --> 00:34:09,240 699 00:34:09,240 --> 00:34:11,700 700 00:34:11,700 --> 00:34:13,950 701 00:34:13,950 --> 00:34:16,110 702 00:34:16,110 --> 00:34:18,629 703 00:34:18,629 --> 00:34:20,790 704 00:34:20,790 --> 00:34:22,409 705 00:34:22,409 --> 00:34:24,210 706 00:34:24,210 --> 00:34:27,240 707 00:34:27,240 --> 00:34:29,669 708 00:34:29,669 --> 00:34:42,300 709 00:34:42,300 --> 00:34:45,750 710 00:34:45,750 --> 00:34:48,000 711 00:34:48,000 --> 00:34:49,889 712 00:34:49,889 --> 00:34:52,110 713 00:34:52,110 --> 00:34:54,210 714 00:34:54,210 --> 00:34:57,750 715 00:34:57,750 --> 00:35:00,540 716 00:35:00,540 --> 00:35:04,170 717 00:35:04,170 --> 00:35:07,050 718 00:35:07,050 --> 00:35:08,460 719 00:35:08,460 --> 00:35:15,360 720 00:35:15,360 --> 00:35:17,100 721 00:35:17,100 --> 00:35:19,500 722 00:35:19,500 --> 00:35:22,830 723 00:35:22,830 --> 00:35:25,140 724 00:35:25,140 --> 00:35:26,670 725 00:35:26,670 --> 00:35:33,540 726 00:35:33,540 --> 00:35:35,580 727 00:35:35,580 --> 00:35:37,890 728 00:35:37,890 --> 00:35:42,560 729 00:35:42,560 --> 00:35:45,660 730 00:35:45,660 --> 00:35:48,600 731 00:35:48,600 --> 00:35:50,940 732 00:35:50,940 --> 00:35:52,920 733 00:35:52,920 --> 00:35:55,440 734 00:35:55,440 --> 00:35:58,800 735 00:35:58,800 --> 00:36:03,480 736 00:36:03,480 --> 00:36:05,490 737 00:36:05,490 --> 00:36:08,220 738 00:36:08,220 --> 00:36:09,780 739 00:36:09,780 --> 00:36:12,360 740 00:36:12,360 --> 00:36:14,500 741 00:36:14,500 --> 00:36:17,560 742 00:36:17,560 --> 00:36:19,840 743 00:36:19,840 --> 00:36:23,890 744 00:36:23,890 --> 00:36:25,390 745 00:36:25,390 --> 00:36:26,620 746 00:36:26,620 --> 00:36:30,730 747 00:36:30,730 --> 00:36:33,040 748 00:36:33,040 --> 00:36:35,050 749 00:36:35,050 --> 00:36:36,940 750 00:36:36,940 --> 00:36:39,310 751 00:36:39,310 --> 00:36:44,470 752 00:36:44,470 --> 00:36:47,440 753 00:36:47,440 --> 00:36:51,030 754 00:36:51,030 --> 00:36:56,710 755 00:36:56,710 --> 00:37:02,050 756 00:37:02,050 --> 00:37:03,580 757 00:37:03,580 --> 00:37:12,070 758 00:37:12,070 --> 00:37:13,750 759 00:37:13,750 --> 00:37:16,060 760 00:37:16,060 --> 00:37:24,010 761 00:37:24,010 --> 00:37:26,380 762 00:37:26,380 --> 00:37:29,200 763 00:37:29,200 --> 00:37:31,030 764 00:37:31,030 --> 00:37:41,140 765 00:37:41,140 --> 00:37:44,079 766 00:37:44,079 --> 00:37:53,799 767 00:37:53,799 --> 00:37:57,279 768 00:37:57,279 --> 00:38:02,380 769 00:38:02,380 --> 00:38:05,559 770 00:38:05,559 --> 00:38:06,940 771 00:38:06,940 --> 00:38:09,249 772 00:38:09,249 --> 00:38:11,470 773 00:38:11,470 --> 00:38:13,569 774 00:38:13,569 --> 00:38:18,730 775 00:38:18,730 --> 00:38:22,269 776 00:38:22,269 --> 00:38:24,940 777 00:38:24,940 --> 00:38:26,109 778 00:38:26,109 --> 00:38:29,170 779 00:38:29,170 --> 00:38:30,400 780 00:38:30,400 --> 00:38:34,749 781 00:38:34,749 --> 00:38:37,960 782 00:38:37,960 --> 00:38:41,259 783 00:38:41,259 --> 00:38:43,809 784 00:38:43,809 --> 00:38:50,609 785 00:38:50,609 --> 00:38:59,109 786 00:38:59,109 --> 00:39:20,190 787 00:39:20,190 --> 00:39:22,440 788 00:39:22,440 --> 00:39:24,299 789 00:39:24,299 --> 00:39:26,339 790 00:39:26,339 --> 00:39:33,329 791 00:39:33,329 --> 00:39:34,890 792 00:39:34,890 --> 00:39:37,230 793 00:39:37,230 --> 00:39:49,770 794 00:39:49,770 --> 00:39:52,799 795 00:39:52,799 --> 00:39:54,569 796 00:39:54,569 --> 00:39:56,460 797 00:39:56,460 --> 00:39:59,609 798 00:39:59,609 --> 00:40:01,380 799 00:40:01,380 --> 00:40:03,329 800 00:40:03,329 --> 00:40:05,460 801 00:40:05,460 --> 00:40:07,740 802 00:40:07,740 --> 00:40:10,380 803 00:40:10,380 --> 00:40:11,789 804 00:40:11,789 --> 00:40:15,359 805 00:40:15,359 --> 00:40:25,460 806 00:40:25,460 --> 00:40:31,069 807 00:40:31,069 --> 00:40:34,099 808 00:40:34,099 --> 00:40:36,170 809 00:40:36,170 --> 00:40:38,540 810 00:40:38,540 --> 00:40:42,140 811 00:40:42,140 --> 00:40:44,329 812 00:40:44,329 --> 00:40:47,359 813 00:40:47,359 --> 00:40:52,730 814 00:40:52,730 --> 00:41:00,260 815 00:41:00,260 --> 00:41:02,329 816 00:41:02,329 --> 00:41:05,359 817 00:41:05,359 --> 00:41:08,720 818 00:41:08,720 --> 00:41:11,240 819 00:41:11,240 --> 00:41:13,849 820 00:41:13,849 --> 00:41:15,859 821 00:41:15,859 --> 00:41:26,050 822 00:41:26,050 --> 00:41:28,609 823 00:41:28,609 --> 00:41:32,270 824 00:41:32,270 --> 00:41:40,670 825 00:41:40,670 --> 00:41:42,859 826 00:41:42,859 --> 00:41:45,950 827 00:41:45,950 --> 00:41:47,780 828 00:41:47,780 --> 00:41:51,290 829 00:41:51,290 --> 00:41:53,329 830 00:41:53,329 --> 00:42:00,990 831 00:42:00,990 --> 00:42:03,730 832 00:42:03,730 --> 00:42:06,130 833 00:42:06,130 --> 00:42:09,130 834 00:42:09,130 --> 00:42:11,799 835 00:42:11,799 --> 00:42:13,510 836 00:42:13,510 --> 00:42:16,120 837 00:42:16,120 --> 00:42:18,160 838 00:42:18,160 --> 00:42:20,260 839 00:42:20,260 --> 00:42:22,900 840 00:42:22,900 --> 00:42:27,760 841 00:42:27,760 --> 00:42:29,910 842 00:42:29,910 --> 00:42:32,440 843 00:42:32,440 --> 00:42:37,660 844 00:42:37,660 --> 00:42:42,250 845 00:42:42,250 --> 00:42:45,549 846 00:42:45,549 --> 00:42:48,099 847 00:42:48,099 --> 00:42:50,559 848 00:42:50,559 --> 00:42:52,270 849 00:42:52,270 --> 00:42:55,120 850 00:42:55,120 --> 00:43:02,559 851 00:43:02,559 --> 00:43:04,590 852 00:43:04,590 --> 00:43:04,600 853 00:43:04,600 --> 00:43:07,340 854 00:43:07,340 --> 00:43:10,830 855 00:43:10,830 --> 00:43:13,350 856 00:43:13,350 --> 00:43:15,390 857 00:43:15,390 --> 00:43:17,609 858 00:43:17,609 --> 00:43:19,260 859 00:43:19,260 --> 00:43:21,930 860 00:43:21,930 --> 00:43:28,940 861 00:43:28,940 --> 00:43:28,950 862 00:43:28,950 --> 00:43:32,210 863 00:43:32,210 --> 00:43:36,940 864 00:43:36,940 --> 00:43:39,370 865 00:43:39,370 --> 00:43:41,200 866 00:43:41,200 --> 00:43:45,850 867 00:43:45,850 --> 00:43:48,070 868 00:43:48,070 --> 00:43:50,110 869 00:43:50,110 --> 00:44:05,940 870 00:44:05,940 --> 00:44:08,640 871 00:44:08,640 --> 00:44:11,039 872 00:44:11,039 --> 00:44:13,859 873 00:44:13,859 --> 00:44:15,630 874 00:44:15,630 --> 00:44:18,240 875 00:44:18,240 --> 00:44:22,890 876 00:44:22,890 --> 00:44:27,390 877 00:44:27,390 --> 00:44:31,799 878 00:44:31,799 --> 00:44:34,890 879 00:44:34,890 --> 00:44:36,390 880 00:44:36,390 --> 00:44:39,359 881 00:44:39,359 --> 00:44:41,069 882 00:44:41,069 --> 00:44:42,809 883 00:44:42,809 --> 00:44:45,630 884 00:44:45,630 --> 00:44:47,760 885 00:44:47,760 --> 00:44:49,940 886 00:44:49,940 --> 00:44:52,829 887 00:44:52,829 --> 00:44:54,359 888 00:44:54,359 --> 00:44:56,400 889 00:44:56,400 --> 00:44:58,890 890 00:44:58,890 --> 00:45:01,440 891 00:45:01,440 --> 00:45:03,809 892 00:45:03,809 --> 00:45:05,760 893 00:45:05,760 --> 00:45:07,500 894 00:45:07,500 --> 00:45:09,089 895 00:45:09,089 --> 00:45:12,900 896 00:45:12,900 --> 00:45:14,819 897 00:45:14,819 --> 00:45:16,859 898 00:45:16,859 --> 00:45:19,220 899 00:45:19,220 --> 00:45:22,559 900 00:45:22,559 --> 00:45:24,539 901 00:45:24,539 --> 00:45:26,520 902 00:45:26,520 --> 00:45:29,460 903 00:45:29,460 --> 00:45:31,470 904 00:45:31,470 --> 00:45:35,970 905 00:45:35,970 --> 00:45:38,130 906 00:45:38,130 --> 00:45:39,750 907 00:45:39,750 --> 00:45:43,620 908 00:45:43,620 --> 00:45:50,770 909 00:45:50,770 --> 00:45:57,500 910 00:45:57,500 --> 00:45:57,510 911 00:45:57,510 --> 00:46:00,430 912 00:46:00,430 --> 00:46:02,410 913 00:46:02,410 --> 00:46:04,720 914 00:46:04,720 --> 00:46:06,700 915 00:46:06,700 --> 00:46:08,890 916 00:46:08,890 --> 00:46:10,480 917 00:46:10,480 --> 00:46:12,160 918 00:46:12,160 --> 00:46:13,870 919 00:46:13,870 --> 00:46:18,609 920 00:46:18,609 --> 00:46:21,010 921 00:46:21,010 --> 00:46:23,109 922 00:46:23,109 --> 00:46:26,220 923 00:46:26,220 --> 00:46:28,359 924 00:46:28,359 --> 00:46:30,579 925 00:46:30,579 --> 00:46:37,930 926 00:46:37,930 --> 00:46:40,960 927 00:46:40,960 --> 00:46:42,730 928 00:46:42,730 --> 00:46:48,160 929 00:46:48,160 --> 00:46:50,920 930 00:46:50,920 --> 00:46:53,079 931 00:46:53,079 --> 00:46:55,720 932 00:46:55,720 --> 00:46:59,440 933 00:46:59,440 --> 00:47:04,210 934 00:47:04,210 --> 00:47:05,560 935 00:47:05,560 --> 00:47:08,050 936 00:47:08,050 --> 00:47:10,000 937 00:47:10,000 --> 00:47:12,640 938 00:47:12,640 --> 00:47:14,109 939 00:47:14,109 --> 00:47:16,560 940 00:47:16,560 --> 00:47:19,089 941 00:47:19,089 --> 00:47:21,430 942 00:47:21,430 --> 00:47:23,230 943 00:47:23,230 --> 00:47:26,130 944 00:47:26,130 --> 00:47:28,810 945 00:47:28,810 --> 00:47:30,940 946 00:47:30,940 --> 00:47:32,710 947 00:47:32,710 --> 00:47:36,220 948 00:47:36,220 --> 00:47:38,829 949 00:47:38,829 --> 00:47:41,470 950 00:47:41,470 --> 00:47:45,099 951 00:47:45,099 --> 00:47:47,349 952 00:47:47,349 --> 00:47:48,880 953 00:47:48,880 --> 00:47:51,040 954 00:47:51,040 --> 00:47:52,450 955 00:47:52,450 --> 00:47:56,020 956 00:47:56,020 --> 00:47:58,150 957 00:47:58,150 --> 00:48:01,540 958 00:48:01,540 --> 00:48:02,800 959 00:48:02,800 --> 00:48:05,140 960 00:48:05,140 --> 00:48:07,329 961 00:48:07,329 --> 00:48:09,339 962 00:48:09,339 --> 00:48:11,170 963 00:48:11,170 --> 00:48:14,910 964 00:48:14,910 --> 00:48:17,499 965 00:48:17,499 --> 00:48:21,819 966 00:48:21,819 --> 00:48:24,009 967 00:48:24,009 --> 00:48:26,739 968 00:48:26,739 --> 00:48:28,420 969 00:48:28,420 --> 00:48:32,499 970 00:48:32,499 --> 00:48:38,940 971 00:48:38,940 --> 00:48:41,529 972 00:48:41,529 --> 00:48:43,299 973 00:48:43,299 --> 00:48:45,729 974 00:48:45,729 --> 00:48:48,700 975 00:48:48,700 --> 00:48:50,979 976 00:48:50,979 --> 00:48:54,220 977 00:48:54,220 --> 00:48:56,289 978 00:48:56,289 --> 00:49:01,359 979 00:49:01,359 --> 00:49:03,519 980 00:49:03,519 --> 00:49:05,440 981 00:49:05,440 --> 00:49:07,450 982 00:49:07,450 --> 00:49:09,970 983 00:49:09,970 --> 00:49:11,979 984 00:49:11,979 --> 00:49:14,319 985 00:49:14,319 --> 00:49:16,359 986 00:49:16,359 --> 00:49:20,109 987 00:49:20,109 --> 00:49:23,799 988 00:49:23,799 --> 00:49:25,870 989 00:49:25,870 --> 00:49:28,599 990 00:49:28,599 --> 00:49:31,150 991 00:49:31,150 --> 00:49:34,029 992 00:49:34,029 --> 00:49:41,620 993 00:49:41,620 --> 00:49:44,340 994 00:49:44,340 --> 00:49:48,130 995 00:49:48,130 --> 00:49:51,700 996 00:49:51,700 --> 00:49:55,720 997 00:49:55,720 --> 00:49:58,480 998 00:49:58,480 --> 00:50:01,270 999 00:50:01,270 --> 00:50:07,960 1000 00:50:07,960 --> 00:50:10,300 1001 00:50:10,300 --> 00:50:12,940 1002 00:50:12,940 --> 00:50:15,130 1003 00:50:15,130 --> 00:50:19,750 1004 00:50:19,750 --> 00:50:21,460 1005 00:50:21,460 --> 00:50:25,630 1006 00:50:25,630 --> 00:50:46,470 1007 00:50:46,470 --> 00:50:52,140 1008 00:50:52,140 --> 00:50:56,790 1009 00:50:56,790 --> 00:51:03,510 1010 00:51:03,510 --> 00:51:06,690 1011 00:51:06,690 --> 00:51:08,220 1012 00:51:08,220 --> 00:51:12,300 1013 00:51:12,300 --> 00:51:19,840 1014 00:51:19,840 --> 00:51:22,750 1015 00:51:22,750 --> 00:51:31,990 1016 00:51:31,990 --> 00:51:36,310 1017 00:51:36,310 --> 00:51:40,210 1018 00:51:40,210 --> 00:51:44,950 1019 00:51:44,950 --> 00:51:49,720 1020 00:51:49,720 --> 00:51:52,870 1021 00:51:52,870 --> 00:51:54,609 1022 00:51:54,609 --> 00:52:01,510 1023 00:52:01,510 --> 00:52:03,400 1024 00:52:03,400 --> 00:52:06,510 1025 00:52:06,510 --> 00:52:11,910 1026 00:52:11,910 --> 00:52:26,950 1027 00:52:26,950 --> 00:52:28,690 1028 00:52:28,690 --> 00:52:31,690 1029 00:52:31,690 --> 00:52:34,089 1030 00:52:34,089 --> 00:52:38,530 1031 00:52:38,530 --> 00:52:41,220 1032 00:52:41,220 --> 00:52:43,690 1033 00:52:43,690 --> 00:52:46,810 1034 00:52:46,810 --> 00:52:49,089 1035 00:52:49,089 --> 00:52:51,970 1036 00:52:51,970 --> 00:52:54,070 1037 00:52:54,070 --> 00:52:58,150 1038 00:52:58,150 --> 00:53:00,880 1039 00:53:00,880 --> 00:53:04,900 1040 00:53:04,900 --> 00:53:08,109 1041 00:53:08,109 --> 00:53:10,089 1042 00:53:10,089 --> 00:53:12,760 1043 00:53:12,760 --> 00:53:14,740 1044 00:53:14,740 --> 00:53:16,930 1045 00:53:16,930 --> 00:53:18,940 1046 00:53:18,940 --> 00:53:21,070 1047 00:53:21,070 --> 00:53:23,109 1048 00:53:23,109 --> 00:53:25,630 1049 00:53:25,630 --> 00:53:27,490 1050 00:53:27,490 --> 00:53:31,150 1051 00:53:31,150 --> 00:53:34,390 1052 00:53:34,390 --> 00:53:36,310 1053 00:53:36,310 --> 00:53:37,570 1054 00:53:37,570 --> 00:53:39,609 1055 00:53:39,609 --> 00:53:41,470 1056 00:53:41,470 --> 00:53:42,760 1057 00:53:42,760 --> 00:53:45,720 1058 00:53:45,720 --> 00:53:48,610 1059 00:53:48,610 --> 00:53:51,250 1060 00:53:51,250 --> 00:53:53,860 1061 00:53:53,860 --> 00:53:56,080 1062 00:53:56,080 --> 00:54:00,820 1063 00:54:00,820 --> 00:54:07,360 1064 00:54:07,360 --> 00:54:09,550 1065 00:54:09,550 --> 00:54:11,200 1066 00:54:11,200 --> 00:54:13,120 1067 00:54:13,120 --> 00:54:15,520 1068 00:54:15,520 --> 00:54:17,140 1069 00:54:17,140 --> 00:54:19,690 1070 00:54:19,690 --> 00:54:22,330 1071 00:54:22,330 --> 00:54:24,730 1072 00:54:24,730 --> 00:54:29,790 1073 00:54:29,790 --> 00:54:32,590 1074 00:54:32,590 --> 00:54:34,570 1075 00:54:34,570 --> 00:54:37,050 1076 00:54:37,050 --> 00:54:41,260 1077 00:54:41,260 --> 00:54:44,980 1078 00:54:44,980 --> 00:54:46,630 1079 00:54:46,630 --> 00:54:48,760 1080 00:54:48,760 --> 00:54:52,300 1081 00:54:52,300 --> 00:54:54,130 1082 00:54:54,130 --> 00:54:56,260 1083 00:54:56,260 --> 00:54:58,330 1084 00:54:58,330 --> 00:55:00,730 1085 00:55:00,730 --> 00:55:05,890 1086 00:55:05,890 --> 00:55:08,170 1087 00:55:08,170 --> 00:55:10,770 1088 00:55:10,770 --> 00:55:13,450 1089 00:55:13,450 --> 00:55:15,700 1090 00:55:15,700 --> 00:55:19,660 1091 00:55:19,660 --> 00:55:21,670 1092 00:55:21,670 --> 00:55:24,730 1093 00:55:24,730 --> 00:55:27,040 1094 00:55:27,040 --> 00:55:28,990 1095 00:55:28,990 --> 00:55:31,570 1096 00:55:31,570 --> 00:55:34,120 1097 00:55:34,120 --> 00:55:38,400 1098 00:55:38,400 --> 00:55:46,109 1099 00:55:46,109 --> 00:55:52,120 1100 00:55:52,120 --> 00:55:53,920 1101 00:55:53,920 --> 00:55:57,460 1102 00:55:57,460 --> 00:56:00,910 1103 00:56:00,910 --> 00:56:03,730 1104 00:56:03,730 --> 00:56:05,819 1105 00:56:05,819 --> 00:56:08,349 1106 00:56:08,349 --> 00:56:10,270 1107 00:56:10,270 --> 00:56:12,549 1108 00:56:12,549 --> 00:56:14,920 1109 00:56:14,920 --> 00:56:16,839 1110 00:56:16,839 --> 00:56:18,970 1111 00:56:18,970 --> 00:56:23,109 1112 00:56:23,109 --> 00:56:25,539 1113 00:56:25,539 --> 00:56:27,280 1114 00:56:27,280 --> 00:56:29,140 1115 00:56:29,140 --> 00:56:32,020 1116 00:56:32,020 --> 00:56:33,720 1117 00:56:33,720 --> 00:56:36,039 1118 00:56:36,039 --> 00:56:37,809 1119 00:56:37,809 --> 00:56:39,490 1120 00:56:39,490 --> 00:56:41,260 1121 00:56:41,260 --> 00:56:46,299 1122 00:56:46,299 --> 00:56:50,680 1123 00:56:50,680 --> 00:56:55,900 1124 00:56:55,900 --> 00:56:58,450 1125 00:56:58,450 --> 00:56:59,950 1126 00:56:59,950 --> 00:57:01,780 1127 00:57:01,780 --> 00:57:04,390 1128 00:57:04,390 --> 00:57:06,670 1129 00:57:06,670 --> 00:57:08,109 1130 00:57:08,109 --> 00:57:10,599 1131 00:57:10,599 --> 00:57:12,339 1132 00:57:12,339 --> 00:57:13,720 1133 00:57:13,720 --> 00:57:24,150 1134 00:57:24,150 --> 00:57:27,600 1135 00:57:27,600 --> 00:57:27,610 1136 00:57:27,610 --> 00:57:28,200 1137 00:57:28,200 --> 00:57:31,220 1138 00:57:31,220 --> 00:57:31,230 1139 00:57:31,230 --> 00:57:38,240 1140 00:57:38,240 --> 00:57:38,250 1141 00:57:38,250 --> 00:57:42,020 1142 00:57:42,020 --> 00:57:44,450 1143 00:57:44,450 --> 00:57:48,520 1144 00:57:48,520 --> 00:57:51,230 1145 00:57:51,230 --> 00:57:53,870 1146 00:57:53,870 --> 00:57:57,560 1147 00:57:57,560 --> 00:58:01,520 1148 00:58:01,520 --> 00:58:03,380 1149 00:58:03,380 --> 00:58:06,680 1150 00:58:06,680 --> 00:58:09,920 1151 00:58:09,920 --> 00:58:11,300 1152 00:58:11,300 --> 00:58:14,720 1153 00:58:14,720 --> 00:58:19,520 1154 00:58:19,520 --> 00:58:20,870 1155 00:58:20,870 --> 00:58:23,060 1156 00:58:23,060 --> 00:58:24,470 1157 00:58:24,470 --> 00:58:27,470 1158 00:58:27,470 --> 00:58:28,700 1159 00:58:28,700 --> 00:58:30,650 1160 00:58:30,650 --> 00:58:32,450 1161 00:58:32,450 --> 00:58:33,890 1162 00:58:33,890 --> 00:58:36,350 1163 00:58:36,350 --> 00:58:38,900 1164 00:58:38,900 --> 00:58:41,750 1165 00:58:41,750 --> 00:58:43,700 1166 00:58:43,700 --> 00:58:45,770 1167 00:58:45,770 --> 00:58:47,210 1168 00:58:47,210 --> 00:58:51,680 1169 00:58:51,680 --> 00:58:56,450 1170 00:58:56,450 --> 00:58:58,460 1171 00:58:58,460 --> 00:59:00,890 1172 00:59:00,890 --> 00:59:04,250 1173 00:59:04,250 --> 00:59:06,800 1174 00:59:06,800 --> 00:59:09,020 1175 00:59:09,020 --> 00:59:11,060 1176 00:59:11,060 --> 00:59:13,280 1177 00:59:13,280 --> 00:59:14,870 1178 00:59:14,870 --> 00:59:17,930 1179 00:59:17,930 --> 00:59:20,660 1180 00:59:20,660 --> 00:59:23,720 1181 00:59:23,720 --> 00:59:25,790 1182 00:59:25,790 --> 00:59:28,490 1183 00:59:28,490 --> 00:59:30,020 1184 00:59:30,020 --> 00:59:32,660 1185 00:59:32,660 --> 00:59:34,520 1186 00:59:34,520 --> 00:59:37,130 1187 00:59:37,130 --> 00:59:39,710 1188 00:59:39,710 --> 00:59:41,270 1189 00:59:41,270 --> 00:59:43,310 1190 00:59:43,310 --> 00:59:45,500 1191 00:59:45,500 --> 00:59:46,820 1192 00:59:46,820 --> 00:59:49,370 1193 00:59:49,370 --> 00:59:49,380 1194 00:59:49,380 --> 00:59:52,240 1195 00:59:52,240 --> 00:59:59,030 1196 00:59:59,030 --> 01:00:01,520 1197 01:00:01,520 --> 01:00:04,700 1198 01:00:04,700 --> 01:00:09,530 1199 01:00:09,530 --> 01:00:11,930 1200 01:00:11,930 --> 01:00:15,349 1201 01:00:15,349 --> 01:00:17,390 1202 01:00:17,390 --> 01:00:19,640 1203 01:00:19,640 --> 01:00:22,970 1204 01:00:22,970 --> 01:00:25,370 1205 01:00:25,370 --> 01:00:27,770 1206 01:00:27,770 --> 01:00:29,420 1207 01:00:29,420 --> 01:00:31,490 1208 01:00:31,490 --> 01:00:34,849 1209 01:00:34,849 --> 01:00:37,250 1210 01:00:37,250 --> 01:00:39,500 1211 01:00:39,500 --> 01:00:41,390 1212 01:00:41,390 --> 01:00:43,220 1213 01:00:43,220 --> 01:00:44,900 1214 01:00:44,900 --> 01:00:48,740 1215 01:00:48,740 --> 01:00:50,390 1216 01:00:50,390 --> 01:00:52,730 1217 01:00:52,730 --> 01:00:53,890 1218 01:00:53,890 --> 01:00:56,839 1219 01:00:56,839 --> 01:00:59,180 1220 01:00:59,180 --> 01:01:08,480 1221 01:01:08,480 --> 01:01:13,640 1222 01:01:13,640 --> 01:01:15,559 1223 01:01:15,559 --> 01:01:17,359 1224 01:01:17,359 --> 01:01:19,160 1225 01:01:19,160 --> 01:01:24,680 1226 01:01:24,680 --> 01:01:25,670 1227 01:01:25,670 --> 01:01:29,240 1228 01:01:29,240 --> 01:01:36,520 1229 01:01:36,520 --> 01:01:36,530 1230 01:01:36,530 --> 01:01:38,799 1231 01:01:38,799 --> 01:01:41,559 1232 01:01:41,559 --> 01:01:43,959 1233 01:01:43,959 --> 01:01:45,969 1234 01:01:45,969 --> 01:01:48,489 1235 01:01:48,489 --> 01:01:51,069 1236 01:01:51,069 --> 01:01:52,539 1237 01:01:52,539 --> 01:01:54,759 1238 01:01:54,759 --> 01:01:57,519 1239 01:01:57,519 --> 01:01:59,890 1240 01:01:59,890 --> 01:02:01,929 1241 01:02:01,929 --> 01:02:03,999 1242 01:02:03,999 --> 01:02:06,039 1243 01:02:06,039 --> 01:02:11,429 1244 01:02:11,429 --> 01:02:14,679 1245 01:02:14,679 --> 01:02:17,620 1246 01:02:17,620 --> 01:02:19,359 1247 01:02:19,359 --> 01:02:22,179 1248 01:02:22,179 --> 01:02:26,019 1249 01:02:26,019 --> 01:02:30,699 1250 01:02:30,699 --> 01:02:32,439 1251 01:02:32,439 --> 01:02:35,829 1252 01:02:35,829 --> 01:02:39,399 1253 01:02:39,399 --> 01:02:41,640 1254 01:02:41,640 --> 01:02:44,049 1255 01:02:44,049 --> 01:02:45,759 1256 01:02:45,759 --> 01:02:49,239 1257 01:02:49,239 --> 01:02:51,279 1258 01:02:51,279 --> 01:02:56,259 1259 01:02:56,259 --> 01:02:57,819 1260 01:02:57,819 --> 01:03:00,669 1261 01:03:00,669 --> 01:03:08,049 1262 01:03:08,049 --> 01:03:10,870 1263 01:03:10,870 --> 01:03:13,769 1264 01:03:13,769 --> 01:03:13,779 1265 01:03:13,779 --> 01:03:16,330 1266 01:03:16,330 --> 01:03:16,340 1267 01:03:16,340 --> 01:03:18,230 1268 01:03:18,230 --> 01:03:20,930 1269 01:03:20,930 --> 01:03:23,089 1270 01:03:23,089 --> 01:03:25,280 1271 01:03:25,280 --> 01:03:30,920 1272 01:03:30,920 --> 01:03:34,880 1273 01:03:34,880 --> 01:03:37,579 1274 01:03:37,579 --> 01:03:40,220 1275 01:03:40,220 --> 01:03:42,020 1276 01:03:42,020 --> 01:03:44,060 1277 01:03:44,060 --> 01:03:46,670 1278 01:03:46,670 --> 01:03:53,780 1279 01:03:53,780 --> 01:03:55,070 1280 01:03:55,070 --> 01:03:56,810 1281 01:03:56,810 --> 01:03:58,550 1282 01:03:58,550 --> 01:04:00,079 1283 01:04:00,079 --> 01:04:04,730 1284 01:04:04,730 --> 01:04:08,240 1285 01:04:08,240 --> 01:04:09,770 1286 01:04:09,770 --> 01:04:11,120 1287 01:04:11,120 --> 01:04:13,099 1288 01:04:13,099 --> 01:04:14,930 1289 01:04:14,930 --> 01:04:17,839 1290 01:04:17,839 --> 01:04:20,240 1291 01:04:20,240 --> 01:04:24,560 1292 01:04:24,560 --> 01:04:26,000 1293 01:04:26,000 --> 01:04:30,530 1294 01:04:30,530 --> 01:04:33,170 1295 01:04:33,170 --> 01:04:35,240 1296 01:04:35,240 --> 01:04:37,609 1297 01:04:37,609 --> 01:04:41,030 1298 01:04:41,030 --> 01:04:43,730 1299 01:04:43,730 --> 01:04:46,160 1300 01:04:46,160 --> 01:04:53,720 1301 01:04:53,720 --> 01:04:59,960 1302 01:04:59,960 --> 01:05:04,670 1303 01:05:04,670 --> 01:05:06,530 1304 01:05:06,530 --> 01:05:08,870 1305 01:05:08,870 --> 01:05:12,620 1306 01:05:12,620 --> 01:05:16,970 1307 01:05:16,970 --> 01:05:19,550 1308 01:05:19,550 --> 01:05:21,980 1309 01:05:21,980 --> 01:05:25,580 1310 01:05:25,580 --> 01:05:27,620 1311 01:05:27,620 --> 01:05:30,080 1312 01:05:30,080 --> 01:05:34,160 1313 01:05:34,160 --> 01:05:36,500 1314 01:05:36,500 --> 01:05:39,070 1315 01:05:39,070 --> 01:05:41,450 1316 01:05:41,450 --> 01:05:43,910 1317 01:05:43,910 --> 01:05:46,550 1318 01:05:46,550 --> 01:05:50,950 1319 01:05:50,950 --> 01:05:54,620 1320 01:05:54,620 --> 01:05:57,680 1321 01:05:57,680 --> 01:05:59,690 1322 01:05:59,690 --> 01:06:02,060 1323 01:06:02,060 --> 01:06:04,520 1324 01:06:04,520 --> 01:06:08,600 1325 01:06:08,600 --> 01:06:10,730 1326 01:06:10,730 --> 01:06:12,980 1327 01:06:12,980 --> 01:06:16,040 1328 01:06:16,040 --> 01:06:17,630 1329 01:06:17,630 --> 01:06:19,040 1330 01:06:19,040 --> 01:06:22,150 1331 01:06:22,150 --> 01:06:24,590 1332 01:06:24,590 --> 01:06:27,050 1333 01:06:27,050 --> 01:06:29,480 1334 01:06:29,480 --> 01:06:33,260 1335 01:06:33,260 --> 01:06:37,850 1336 01:06:37,850 --> 01:06:39,650 1337 01:06:39,650 --> 01:06:42,020 1338 01:06:42,020 --> 01:06:46,490 1339 01:06:46,490 --> 01:06:49,520 1340 01:06:49,520 --> 01:06:52,850 1341 01:06:52,850 --> 01:06:55,700 1342 01:06:55,700 --> 01:06:58,610 1343 01:06:58,610 --> 01:07:01,490 1344 01:07:01,490 --> 01:07:03,530 1345 01:07:03,530 --> 01:07:06,320 1346 01:07:06,320 --> 01:07:07,069 1347 01:07:07,069 --> 01:07:13,190 1348 01:07:13,190 --> 01:07:16,039 1349 01:07:16,039 --> 01:07:19,640 1350 01:07:19,640 --> 01:07:21,979 1351 01:07:21,979 --> 01:07:26,479 1352 01:07:26,479 --> 01:07:29,569 1353 01:07:29,569 --> 01:07:31,249 1354 01:07:31,249 --> 01:07:33,019 1355 01:07:33,019 --> 01:07:38,299 1356 01:07:38,299 --> 01:07:41,599 1357 01:07:41,599 --> 01:07:43,370 1358 01:07:43,370 --> 01:07:45,680 1359 01:07:45,680 --> 01:07:49,670 1360 01:07:49,670 --> 01:07:52,430 1361 01:07:52,430 --> 01:07:55,309 1362 01:07:55,309 --> 01:07:57,890 1363 01:07:57,890 --> 01:08:00,099 1364 01:08:00,099 --> 01:08:07,219 1365 01:08:07,219 --> 01:08:09,709 1366 01:08:09,709 --> 01:08:12,709 1367 01:08:12,709 --> 01:08:14,809 1368 01:08:14,809 --> 01:08:16,609 1369 01:08:16,609 --> 01:08:19,700 1370 01:08:19,700 --> 01:08:21,530 1371 01:08:21,530 --> 01:08:24,829 1372 01:08:24,829 --> 01:08:29,349 1373 01:08:29,349 --> 01:08:40,180 1374 01:08:40,180 --> 01:08:47,450 1375 01:08:47,450 --> 01:08:50,539 1376 01:08:50,539 --> 01:08:52,430 1377 01:08:52,430 --> 01:08:55,880 1378 01:08:55,880 --> 01:08:57,820 1379 01:08:57,820 --> 01:09:00,079 1380 01:09:00,079 --> 01:09:02,599 1381 01:09:02,599 --> 01:09:07,700 1382 01:09:07,700 --> 01:09:10,640 1383 01:09:10,640 --> 01:09:12,979 1384 01:09:12,979 --> 01:09:15,019 1385 01:09:15,019 --> 01:09:17,840 1386 01:09:17,840 --> 01:09:20,479 1387 01:09:20,479 --> 01:09:24,740 1388 01:09:24,740 --> 01:09:28,280 1389 01:09:28,280 --> 01:09:31,999 1390 01:09:31,999 --> 01:09:36,919 1391 01:09:36,919 --> 01:09:39,380 1392 01:09:39,380 --> 01:09:41,570 1393 01:09:41,570 --> 01:09:45,620 1394 01:09:45,620 --> 01:09:47,689 1395 01:09:47,689 --> 01:09:54,709 1396 01:09:54,709 --> 01:09:56,300 1397 01:09:56,300 --> 01:10:03,160 1398 01:10:03,160 --> 01:10:06,020 1399 01:10:06,020 --> 01:10:08,090 1400 01:10:08,090 --> 01:10:09,950 1401 01:10:09,950 --> 01:10:13,580 1402 01:10:13,580 --> 01:10:15,320 1403 01:10:15,320 --> 01:10:18,620 1404 01:10:18,620 --> 01:10:20,300 1405 01:10:20,300 --> 01:10:24,200 1406 01:10:24,200 --> 01:10:26,810 1407 01:10:26,810 --> 01:10:30,439 1408 01:10:30,439 --> 01:10:32,510 1409 01:10:32,510 --> 01:10:34,340 1410 01:10:34,340 --> 01:10:35,899 1411 01:10:35,899 --> 01:10:37,820 1412 01:10:37,820 --> 01:10:41,990 1413 01:10:41,990 --> 01:10:43,550 1414 01:10:43,550 --> 01:10:44,990 1415 01:10:44,990 --> 01:10:46,520 1416 01:10:46,520 --> 01:10:46,530 1417 01:10:46,530 --> 01:10:50,010 1418 01:10:50,010 --> 01:10:53,320 1419 01:10:53,320 --> 01:10:55,030 1420 01:10:55,030 --> 01:10:57,820 1421 01:10:57,820 --> 01:11:01,720 1422 01:11:01,720 --> 01:11:03,550 1423 01:11:03,550 --> 01:11:06,330 1424 01:11:06,330 --> 01:11:09,580 1425 01:11:09,580 --> 01:11:12,040 1426 01:11:12,040 --> 01:11:13,870 1427 01:11:13,870 --> 01:11:16,000 1428 01:11:16,000 --> 01:11:18,400 1429 01:11:18,400 --> 01:11:20,830 1430 01:11:20,830 --> 01:11:23,350 1431 01:11:23,350 --> 01:11:28,600 1432 01:11:28,600 --> 01:11:30,640 1433 01:11:30,640 --> 01:11:32,650 1434 01:11:32,650 --> 01:11:37,900 1435 01:11:37,900 --> 01:11:39,430 1436 01:11:39,430 --> 01:11:42,130 1437 01:11:42,130 --> 01:11:44,920 1438 01:11:44,920 --> 01:11:46,600 1439 01:11:46,600 --> 01:11:48,130 1440 01:11:48,130 --> 01:11:50,470 1441 01:11:50,470 --> 01:11:52,690 1442 01:11:52,690 --> 01:11:55,600 1443 01:11:55,600 --> 01:11:57,130 1444 01:11:57,130 --> 01:12:01,210 1445 01:12:01,210 --> 01:12:03,580 1446 01:12:03,580 --> 01:12:05,860 1447 01:12:05,860 --> 01:12:08,230 1448 01:12:08,230 --> 01:12:14,980 1449 01:12:14,980 --> 01:12:17,230 1450 01:12:17,230 --> 01:12:22,090 1451 01:12:22,090 --> 01:12:24,490 1452 01:12:24,490 --> 01:12:28,480 1453 01:12:28,480 --> 01:12:31,540 1454 01:12:31,540 --> 01:12:34,150 1455 01:12:34,150 --> 01:12:36,630 1456 01:12:36,630 --> 01:12:40,090 1457 01:12:40,090 --> 01:12:40,100 1458 01:12:40,100 --> 01:12:44,270 1459 01:12:44,270 --> 01:12:47,270 1460 01:12:47,270 --> 01:12:50,720 1461 01:12:50,720 --> 01:12:52,520 1462 01:12:52,520 --> 01:12:55,960 1463 01:12:55,960 --> 01:12:59,540 1464 01:12:59,540 --> 01:13:00,950 1465 01:13:00,950 --> 01:13:03,920 1466 01:13:03,920 --> 01:13:07,520 1467 01:13:07,520 --> 01:13:09,770 1468 01:13:09,770 --> 01:13:13,760 1469 01:13:13,760 --> 01:13:16,100 1470 01:13:16,100 --> 01:13:18,350 1471 01:13:18,350 --> 01:13:20,390 1472 01:13:20,390 --> 01:13:23,420 1473 01:13:23,420 --> 01:13:25,790 1474 01:13:25,790 --> 01:13:28,070 1475 01:13:28,070 --> 01:13:32,090 1476 01:13:32,090 --> 01:13:35,450 1477 01:13:35,450 --> 01:13:38,780 1478 01:13:38,780 --> 01:13:41,270 1479 01:13:41,270 --> 01:13:46,130 1480 01:13:46,130 --> 01:13:47,600 1481 01:13:47,600 --> 01:13:50,750 1482 01:13:50,750 --> 01:13:52,610 1483 01:13:52,610 --> 01:13:56,090 1484 01:13:56,090 --> 01:14:00,230 1485 01:14:00,230 --> 01:14:01,610 1486 01:14:01,610 --> 01:14:03,800 1487 01:14:03,800 --> 01:14:05,990 1488 01:14:05,990 --> 01:14:08,480 1489 01:14:08,480 --> 01:14:10,700 1490 01:14:10,700 --> 01:14:15,140 1491 01:14:15,140 --> 01:14:18,800 1492 01:14:18,800 --> 01:14:22,280 1493 01:14:22,280 --> 01:14:24,020 1494 01:14:24,020 --> 01:14:26,810 1495 01:14:26,810 --> 01:14:29,000 1496 01:14:29,000 --> 01:14:32,990 1497 01:14:32,990 --> 01:14:34,820 1498 01:14:34,820 --> 01:14:37,100 1499 01:14:37,100 --> 01:14:39,740 1500 01:14:39,740 --> 01:14:44,630 1501 01:14:44,630 --> 01:14:47,060 1502 01:14:47,060 --> 01:14:50,360 1503 01:14:50,360 --> 01:14:52,460 1504 01:14:52,460 --> 01:14:56,609 1505 01:14:56,609 --> 01:14:58,979 1506 01:14:58,979 --> 01:15:01,649 1507 01:15:01,649 --> 01:15:05,010 1508 01:15:05,010 --> 01:15:06,899 1509 01:15:06,899 --> 01:15:08,640 1510 01:15:08,640 --> 01:15:10,979 1511 01:15:10,979 --> 01:15:12,870 1512 01:15:12,870 --> 01:15:16,290 1513 01:15:16,290 --> 01:15:30,510 1514 01:15:30,510 --> 01:15:33,000 1515 01:15:33,000 --> 01:15:40,109 1516 01:15:40,109 --> 01:15:43,740 1517 01:15:43,740 --> 01:15:47,160 1518 01:15:47,160 --> 01:15:49,020 1519 01:15:49,020 --> 01:15:54,089 1520 01:15:54,089 --> 01:15:56,070 1521 01:15:56,070 --> 01:15:57,810 1522 01:15:57,810 --> 01:15:59,910 1523 01:15:59,910 --> 01:16:04,290 1524 01:16:04,290 --> 01:16:06,060 1525 01:16:06,060 --> 01:16:08,370 1526 01:16:08,370 --> 01:16:10,470 1527 01:16:10,470 --> 01:16:12,720 1528 01:16:12,720 --> 01:16:15,390 1529 01:16:15,390 --> 01:16:17,490 1530 01:16:17,490 --> 01:16:20,399 1531 01:16:20,399 --> 01:16:24,330 1532 01:16:24,330 --> 01:16:28,020 1533 01:16:28,020 --> 01:16:30,959 1534 01:16:30,959 --> 01:16:33,089 1535 01:16:33,089 --> 01:16:35,580 1536 01:16:35,580 --> 01:16:37,109 1537 01:16:37,109 --> 01:16:40,280 1538 01:16:40,280 --> 01:16:43,350 1539 01:16:43,350 --> 01:16:46,430 1540 01:16:46,430 --> 01:16:49,020 1541 01:16:49,020 --> 01:16:50,520 1542 01:16:50,520 --> 01:16:52,950 1543 01:16:52,950 --> 01:17:00,660 1544 01:17:00,660 --> 01:17:02,879 1545 01:17:02,879 --> 01:17:05,160 1546 01:17:05,160 --> 01:17:09,280 1547 01:17:09,280 --> 01:17:11,980 1548 01:17:11,980 --> 01:17:16,690 1549 01:17:16,690 --> 01:17:19,030 1550 01:17:19,030 --> 01:17:21,910 1551 01:17:21,910 --> 01:17:24,310 1552 01:17:24,310 --> 01:17:25,960 1553 01:17:25,960 --> 01:17:27,580 1554 01:17:27,580 --> 01:17:29,620 1555 01:17:29,620 --> 01:17:35,620 1556 01:17:35,620 --> 01:17:37,840 1557 01:17:37,840 --> 01:17:40,420 1558 01:17:40,420 --> 01:17:43,570 1559 01:17:43,570 --> 01:17:48,930 1560 01:17:48,930 --> 01:17:51,640 1561 01:17:51,640 --> 01:17:54,930 1562 01:17:54,930 --> 01:17:57,670 1563 01:17:57,670 --> 01:17:59,770 1564 01:17:59,770 --> 01:18:02,290 1565 01:18:02,290 --> 01:18:05,110 1566 01:18:05,110 --> 01:18:07,960 1567 01:18:07,960 --> 01:18:10,000 1568 01:18:10,000 --> 01:18:15,640 1569 01:18:15,640 --> 01:18:18,400 1570 01:18:18,400 --> 01:18:22,290 1571 01:18:22,290 --> 01:18:24,790 1572 01:18:24,790 --> 01:18:26,620 1573 01:18:26,620 --> 01:18:29,080 1574 01:18:29,080 --> 01:18:33,810 1575 01:18:33,810 --> 01:18:39,070 1576 01:18:39,070 --> 01:18:42,310 1577 01:18:42,310 --> 01:18:45,640 1578 01:18:45,640 --> 01:18:49,330 1579 01:18:49,330 --> 01:18:51,580 1580 01:18:51,580 --> 01:18:53,740 1581 01:18:53,740 --> 01:18:57,730 1582 01:18:57,730 --> 01:18:59,620 1583 01:18:59,620 --> 01:19:02,560 1584 01:19:02,560 --> 01:19:04,360 1585 01:19:04,360 --> 01:19:07,000 1586 01:19:07,000 --> 01:19:10,660 1587 01:19:10,660 --> 01:19:13,570 1588 01:19:13,570 --> 01:19:15,610 1589 01:19:15,610 --> 01:19:22,480 1590 01:19:22,480 --> 01:19:25,760 1591 01:19:25,760 --> 01:19:27,740 1592 01:19:27,740 --> 01:19:30,020 1593 01:19:30,020 --> 01:19:32,840 1594 01:19:32,840 --> 01:19:34,250 1595 01:19:34,250 --> 01:19:36,830 1596 01:19:36,830 --> 01:19:39,170 1597 01:19:39,170 --> 01:19:41,180 1598 01:19:41,180 --> 01:19:44,810 1599 01:19:44,810 --> 01:19:46,550 1600 01:19:46,550 --> 01:19:48,920 1601 01:19:48,920 --> 01:19:52,160 1602 01:19:52,160 --> 01:19:55,010 1603 01:19:55,010 --> 01:19:58,900 1604 01:19:58,900 --> 01:20:02,090 1605 01:20:02,090 --> 01:20:04,100 1606 01:20:04,100 --> 01:20:06,830 1607 01:20:06,830 --> 01:20:09,920 1608 01:20:09,920 --> 01:20:12,170 1609 01:20:12,170 --> 01:20:12,180 1610 01:20:12,180 --> 01:20:12,920 1611 01:20:12,920 --> 01:20:15,080 1612 01:20:15,080 --> 01:20:17,960 1613 01:20:17,960 --> 01:20:20,120 1614 01:20:20,120 --> 01:20:22,280 1615 01:20:22,280 --> 01:20:25,310 1616 01:20:25,310 --> 01:20:28,310 1617 01:20:28,310 --> 01:20:29,870 1618 01:20:29,870 --> 01:20:31,670 1619 01:20:31,670 --> 01:20:34,250 1620 01:20:34,250 --> 01:20:35,750 1621 01:20:35,750 --> 01:20:38,210 1622 01:20:38,210 --> 01:20:39,980 1623 01:20:39,980 --> 01:20:42,410 1624 01:20:42,410 --> 01:20:44,450 1625 01:20:44,450 --> 01:20:46,100 1626 01:20:46,100 --> 01:20:49,430 1627 01:20:49,430 --> 01:20:51,260 1628 01:20:51,260 --> 01:20:53,690 1629 01:20:53,690 --> 01:20:55,900 1630 01:20:55,900 --> 01:20:59,510 1631 01:20:59,510 --> 01:21:01,340 1632 01:21:01,340 --> 01:21:04,010 1633 01:21:04,010 --> 01:21:07,580 1634 01:21:07,580 --> 01:21:09,830 1635 01:21:09,830 --> 01:21:11,330 1636 01:21:11,330 --> 01:21:12,830 1637 01:21:12,830 --> 01:21:15,470 1638 01:21:15,470 --> 01:21:17,240 1639 01:21:17,240 --> 01:21:18,590 1640 01:21:18,590 --> 01:21:21,140 1641 01:21:21,140 --> 01:21:23,240 1642 01:21:23,240 --> 01:21:25,160 1643 01:21:25,160 --> 01:21:27,110 1644 01:21:27,110 --> 01:21:30,330 1645 01:21:30,330 --> 01:21:32,280 1646 01:21:32,280 --> 01:21:34,020 1647 01:21:34,020 --> 01:21:36,660 1648 01:21:36,660 --> 01:21:39,120 1649 01:21:39,120 --> 01:21:40,800 1650 01:21:40,800 --> 01:21:42,780 1651 01:21:42,780 --> 01:21:44,040 1652 01:21:44,040 --> 01:21:44,050 1653 01:21:44,050 --> 01:21:46,349