Transcript 00:12 okay all right do you have any questions 00:15 you want to ask before I get started 00:17 today 00:25 I guess some of you had indicated a 00:27 problem registering for the course last 00:30 week guess as of this morning there are 00:32 opening so if anybody hasn't register 00:35 and would like to you should be able to 00:37 go into the registration system and do 00:39 that okay all right now we're going to 00:43 start talking about about data 00:47 structures for various applications and 00:53 the first set of data structures I'm 00:55 going to talk about motivated by an 00:57 application to external sorting so in 01:03 today's lecture I'm going to just 01:05 briefly describe the external sorting 01:07 problem and describe an adaptation of 01:10 quicksort to the external sort 01:12 environment and see how this adaptation 01:15 motivates the need for a new data 01:18 structure in the next lecture we'll take 01:20 a look at an adaptation of merge sort to 01:22 external sort or to the external sorting 01:25 environment and see that that adaptation 01:27 calls for yet other data structures and 01:29 then we'll have several lectures in 01:31 which we will study data structures that 01:32 were motivated by external sorting okay 01:36 all right so external sorting what is it 01:40 we want to sort a collection of Records 01:43 we got end records and these records are 01:45 sitting on a disk and when we're done 01:48 sorting the sorted collection is also to 01:51 be sitting on a desk 01:52 so the inputs on a disk outputs on a 01:54 disk the constraint is that the amount 02:02 of space needed by these end records is 02:04 very large sufficiently large that it's 02:06 not possible for you to read in all n 02:08 records sort them using an internal 02:10 source scheme like quick sort or merge 02:12 sort and then write the sorted result 02:14 back ok so that's not a feasible way to 02:19 sort our collection of records we don't 02:21 have enough memory to house all of the 02:23 records at one time now the spaces 02:27 needed may be large for to somewhat 02:30 different reasons the first is that the 02:33 space needed is large larger than the 02:36 size of your memory because you have too 02:38 many records 02:39 individual records maybe small 02:41 individual records maybe large but the 02:44 important characteristic for this reason 02:46 is that the number of Records is very 02:48 large ok another reason why you may not 02:57 be able to read all the records in to 03:00 memory is that you actually have a small 03:03 number of Records but each record is 03:05 individually large and the way you would 03:09 handle these two cases could be very 03:11 different okay but in either case it's 03:18 not practical to read in the entire 03:21 collection of Records sort using an 03:24 internal source scheme and then output 03:26 the result let's look at the second case 03:31 where the number of records is actually 03:33 small which means individual records are 03:37 large now one way to handle this case is 03:42 to transform this problem from a sorting 03:46 problem into a permutation problem where 03:48 you start by just reading in the keys of 03:51 this small number of Records okay so the 03:55 number of keys are small they'll fit in 03:57 memory you read them in you sort the 03:59 keys and determine the required 04:02 permutation that has to be performed and 04:06 then you perform that permutation on the 04:09 records okay so you transform the 04:12 sorting problem to a permutation problem 04:14 and once you know what permutation is 04:16 needed you can read in I mean if the 04:19 number of records are small at so you've 04:21 got only ten records and they're big 04:22 well then you really in the first record 04:23 write it to the right spot in the output 04:26 configuration reading the second record 04:28 write it to the output to the right spot 04:30 in the output configuration and so on do 04:32 it ten times and you've got your sorted 04:34 sequence or maybe you read in the front 04:37 part of your ten records internally 04:41 permute them right out the front parts 04:43 and then read in the next parts and so 04:45 on okay so you can transform this 04:49 sorting problem to a permutation problem 04:51 which is easier to handle 04:53 when the number of Records is small okay 04:58 all right this is not the case that we 05:00 want to focus on the folk the case we 05:02 want to focus on is where the number of 05:03 records is actually very large so it's 05:06 not practical for you to read in all of 05:07 the keys and determine the permutation 05:11 okay now when we study this case of 05:17 external sort large number of Records 05:19 we'll see a need for a data structure 05:24 called a tournament tree and we'll study 05:26 those we'll see need for Huffman trees 05:32 data structures for double-ended 05:35 priority queues now you already see a 05:37 structure for a single-ended priority 05:39 queue a single-ended priority queue is a 05:41 collection of elements each element has 05:44 a priority and for this collection you 05:47 can do things like insert and remove the 05:50 element with the minimum priority say or 05:52 the element with the maximum priority 05:54 and for that you use the classic heap 05:56 data structure okay but a double-ended 06:00 project queue would require you to be 06:02 able to delete at some times the main 06:05 element and sometimes the max element so 06:07 you can delete from both ends of the 06:09 queue will study a technique called 06:14 buffering and the ideas that get 06:20 introduced in external sorting can then 06:25 be adapted back actually to internal 06:27 sorting to improve the performance of 06:29 well-known internal sort methods such as 06:32 heap sort and merge sort okay so even 06:37 though we are focusing on external 06:40 sorting the ideas that are going to get 06:42 developed there can be turned around to 06:44 improve the performance of internal sort 06:46 schemes now in order to study external 06:53 sort we need to have a model for the 06:56 computer that we are working with and 06:58 actually even to study internal sort you 07:00 need to have a model and we take a look 07:03 at the internal 07:06 model later but let's first take a look 07:08 at the external sort model okay so the 07:11 model we're going to be using is a 07:12 fairly simple one you have an automatic 07:16 logic unit this does all of the 07:20 arithmetic s-- the comparisons and so on 07:22 you have a main memory so this thing 07:25 operates off of the main memory it can 07:27 exchange data in memory and so on and 07:30 then you have a disk that's attached to 07:32 the main memory okay so the input begins 07:37 on the disk the output is to be left on 07:39 the disk now the characteristics of the 07:43 disk are the physical media itself the 07:48 data storage is organized in concentric 07:50 circles called a track you have a 07:53 read/write head okay so if you want to 07:56 read or write data you have to know 07:59 where on this this you're going to write 08:01 it or read from you have to move the 08:04 read/write head to the appropriate track 08:06 and then you have to wait for the right 08:09 part of the track to come under the head 08:11 and then you can start the read or the 08:13 right okay so when you're trying to read 08:18 a write the amount of time you take is 08:21 broken up into three components there's 08:24 a seek component this is the time it 08:27 takes to position the read/write head 08:29 over the proper track and that time is 08:37 huge compared to the time it takes your 08:40 arithmetic unit to do operations so it 08:44 may be of the order of five orders of 08:46 magnitude more then the time to do a 08:49 North medic is a very slow operation 08:50 you've got to physically move that head 08:55 you have latency time this is a time it 08:58 takes for the right part of the track 09:00 the proper part of the track to come 09:02 under the head okay so this disk is 09:05 rotating at high speed but you have to 09:09 wait for the proper part of the track to 09:10 come under the head and then you can 09:12 initiate the read or write okay 09:14 the latency is not as bad as the seek 09:18 but 09:19 it's still pretty bad you're looking at 09:22 maybe four orders of magnitude more than 09:26 an arithmetic operation then you have 09:34 the transfer time and that of course 09:36 depends upon how much data you're 09:37 transferring okay typically when you do 09:42 the transfer you would transfer what's 09:44 called a block of data and how big a 09:50 block is maybe something that is 09:52 controlled by an operating system so you 09:55 may not have any control on that when 09:57 you go out to get a block a block could 09:59 be a thousand bytes of data it could be 10:02 more or you may have the ability to 10:07 choose a block size in which case it 10:09 this has to be formatted to that block 10:11 size okay so depending upon the 10:15 environment you're in you may or may not 10:16 have any control over the size of a 10:18 block but you read in blocks of data you 10:23 write out blocks of data all right so 10:30 three components the seek time the 10:33 latency time and the transfer time okay 10:36 seeking latency are large so when you 10:39 get data from a desk you try to minimize 10:43 the number of times you go to the disk 10:45 to get the data 10:53 now typically when we develop algorithms 10:57 for internal applications which means 11:02 all of your data is sitting in memory to 11:04 begin with and the results are to be 11:05 left in memory so typically when we do 11:08 that the model we use is this one okay 11:12 so you have a arithmetic logic unit all 11:15 of the earth semantic logical operations 11:17 are done here and you have a main memory 11:20 if you want to do an ad you get the two 11:23 numbers you want to add from memory 11:25 bring them in here add and for the 11:26 result back okay so in this model the 11:31 number of times you fetch data from 11:33 memory or write back to memory is very 11:36 closely related to the number of 11:38 arithmetic operations you perform and so 11:41 whether you optimize the number of 11:42 operations or the number of memory 11:44 accesses you're pretty much doing the 11:46 same thing and you focus on one or the 11:48 other 11:49 and typically we focus on number of 11:51 operations so in all of the analyses you 11:54 might have done so far you tend to count 11:56 how many ads are you doing how many 11:58 multiplies when you say that this 12:00 algorithm runs in Big O of n square 12:02 you're saying there are n square 12:03 operations and that's a measure of the 12:07 time okay all right so this is a 12:10 simplistic model which was good maybe up 12:12 to 20 years ago or longer and it's a 12:16 model which is easy to handle in terms 12:19 of doing the analysis however the model 12:24 isn't very accurate in the context of 12:27 contemporary computers and is unable to 12:30 explain experimentally observed runtimes 12:34 okay so for example if you look at this 12:37 code here to multiply two matrices a and 12:40 B to produce a result C okay so you have 12:45 your typical three nested for-loops 12:48 producing the results C well then if you 12:54 permute these three loops you get six 12:59 different permutations it makes no 13:01 difference to the result 13:05 okay so you change the order of those 13:07 three loops to any of the six 13:09 permutations there's no change in the 13:12 answer that is produced assuming there 13:14 are there's no numerical instability so 13:17 you get the same answer whichever way 13:18 you do it and if you do your traditional 13:20 analysis there's no difference in the 13:23 operation count either the number of 13:25 multiplies adds and everything is 13:27 exactly the same and so using the 13:30 traditional internal computing model one 13:35 would be tempted to say that the runtime 13:37 of this piece of code isn't going to 13:40 change as you change the permutation or 13:43 ordering of those three four loops 13:46 however if you actually conducted the 13:49 experiment okay you would find that the 13:53 runtime is actually different and not 13:56 only is it different but the traditional 14:00 order you'll find matrix multiplication 14:02 in a textbook 14:03 the ijk order is actually very bad if 14:09 you try it out on computers like a Sun 14:13 station if you were to invert these two 14:16 loops you could see a speed-up which 14:19 could be significant in some experiments 14:22 done some time ago the ratio was 2 but 14:25 today the ratio could be larger than 2 14:29 so the question then is well what why is 14:32 it that the observed runtime varies 14:35 quite a bit in this case by a factor of 14:38 as much as 2 but potentially by larger 14:41 factors even a factor of 8 what why is 14:44 it that that happens when there is no 14:47 difference in the number of operations 14:49 okay and that has to do with the model 14:51 the model is inaccurate ok so a more 14:56 accurate model looks like this which 14:58 says that the PC that you buy for 15:02 example actually has many levels of 15:04 internal memory so you've got your 15:07 arithmetic unit and sitting very close 15:10 to the arithmetic unit is a bank of 15:11 registers the number of registers is 15:14 typically very small somewhere between 8 15:16 and 32 15:18 if you want to add two numbers that are 15:20 sitting in registers you can add them in 15:23 one cycle okay but if the numbers you 15:27 want to add and not sitting in a 15:28 register then you have to get them from 15:30 whatever level of memory they're sitting 15:33 in okay they may be sitting in l1 cache 15:36 okay so that's fast that's still very 15:39 fast memory but not as fast as register 15:42 memory okay typically l1 cache you may 15:46 have about 32 kilobytes of that there's 15:48 not a whole lot of data you can put in 15:50 there but if what you're adding is 15:53 sitting in l1 cache then that data has 15:56 to be brought from l1 cache to the 15:58 registers and then you can perform the 16:00 arithmetic okay you know it takes let's 16:03 say two cycles to move the data from l1 16:05 to the register there are two pieces of 16:07 data that could take you four cycles and 16:09 you want to do the ads so that's five 16:11 okay so if the data you're working on 16:18 isn't sitting in the registers but is 16:20 sitting in l1 cache your computer 16:22 suddenly running five times as slow okay 16:28 but this is pretty small okay so like it 16:34 or not you're going to have to sometimes 16:36 go over here to get data okay but that 16:40 takes ten cycles okay so you've got an 16:43 l2 cache which may be of the order of a 16:46 quarter megabyte some cases maybe even 16:50 half a megabyte okay so you've got to 16:52 know to cache and if your data is not 16:54 sitting here you want to add the two 16:56 numbers you got to bring them from l2 16:57 cache to l1 cache to registers and let's 17:00 say it takes ten cycles to move a piece 17:02 of data from l2 into l1 and the 17:04 registers so now you got two pieces of 17:07 data that's 20 cycles and want to do the 17:09 add 21 now you're running 21 times as 17:11 slow as you could have okay 17:15 unfortunately when you start your 17:17 program everything starts here all your 17:20 data is sitting in the main memory it's 17:22 not sitting in l2 or l1 that's the big 17:25 memory so that may be coming as a 17:29 hundred twenty eight megabytes 256 17:31 megabytes 17:31 gigabyte several gigabytes okay so all 17:36 the good data is starting in main memory 17:39 but that fellow's very slow to get data 17:42 from okay 17:43 I've got a hundred cycles down there to 17:46 get data from there and if you have to 17:50 get both of your pieces of data to add 17:52 from there that's 200 cycles plus one 17:54 you're now running 200 times it's slow 17:57 so the point is that in contemporary 18:03 computers the operations are virtually 18:05 free what costs you is the memory access 18:09 and that's nothing that anybody's ever 18:12 counted before okay and there are many 18:16 different kinds of memory access are you 18:19 accessing here are you accessing here 18:21 here or there and depending upon where 18:24 your data is sitting and you could see a 18:27 huge difference in the runtime while 18:29 performing exactly the same number of 18:31 operations so the challenge in getting 18:34 good performance is arranging the 18:36 computation so that the number of times 18:39 you have to go here versus here is 18:42 reduced or the number of times you to go 18:44 here is reduced or the number of times 18:46 here to go over there is reduced the 18:48 biggest payoff of course is in reducing 18:51 the number of accesses to main memory 18:52 because that's the slowest okay all 18:57 right so with that let's see if we can 19:00 explain the difference in the matrix 19:02 multiplication performance all right so 19:06 first we have to have a better model so 19:09 we've got this multi-level memory model 19:12 and for analysis purposes we'll actually 19:15 simplify it and just say we've got only 19:17 one level of cache the l2 cache okay so 19:20 you got l2 cache you got main memory and 19:22 you got the arithmetic unit then we also 19:26 need to understand how arrays are 19:28 organized all right so suppose you have 19:33 a two-dimensional array X then it's 19:38 represented using something called the 19:40 array of arrays representation which 19:44 means that 19:45 you have a one-dimensional array called 19:48 X and since I've got three rows I got an 19:52 X 0 X 1 X 2 then X 0 is actually a 19:56 pointer to another one-dimensional array 19:58 which is the row number 0 and the next 20:01 one is a pointer to another 20:02 one-dimensional array which is row 1 and 20:04 so on 20:05 so a two-dimensional array is really 20:08 represented as a collection of 20:09 one-dimensional arrays also when you do 20:18 a when you service a cache miss ok so 20:22 let's say somebody tries to access X 0 0 20:24 so they want to get this fellow ok and 20:28 you go to main memory to fetch X 0 0 20:31 what is transferred to the l2 cache 20:33 isn't just the value of x 0 0 but you 20:38 fill a cache line so if a cache line is 20:40 let's say 32 bits wide and you're 20:42 dealing with 8-bit integers then when 20:47 you service the cache miss for X 0 0 ABC 20:50 and D will all get transferred into l2 20:52 cache ok so just like when you go to the 20:57 disk to get data you don't just get the 21:00 one piece of data you want but you get 21:01 the whole block that contains it same 21:03 thing happens here if the data you want 21:06 isn't in cache say it no to cache when 21:10 you go to the main memory you bring in 21:12 the corresponding cache line size ok our 21:17 sense the size of l2 cache is very small 21:21 compared to that of main memory it's 21:25 possible that later in the program when 21:27 you come to get some other part of this 21:29 array it'll end up overwriting the cache 21:33 line into which you put this piece of 21:35 data so the contents of cache keep 21:38 getting changed you end up losing data 21:40 that you brought in before because the 21:42 difference in size between main memory 21:43 and your cache all right so let's 21:49 analyze this thing 21:54 all right so let's first look at the ijk 21:56 order okay so when you run this loop I'm 22:02 first going to compute c00 okay I look 22:07 back up a bit 22:08 so when I'm doing this if I look at this 22:10 line here this line is going to get hit 22:12 n cubed times okay so let's look at the 22:20 computation of c00 okay so when you do 22:23 see zero zero we will need to first get 22:28 the value of C zero zero incremented by 22:30 the right side then on the next round 22:32 gets the zero zero again and increment 22:34 it again by the new right side and so on 22:36 okay so even though we're going to 22:39 access the C values n cubed times when 22:44 we access C 0 0 and times that's done in 22:48 succession one after the other so the 22:50 value of C zero zero is brought into 22:52 cash once and then you use it n times 22:56 before it disappears okay and when you 23:00 bring in C zero zero you will also bring 23:03 in this fellow and that fellow and that 23:06 fellow if you can accommodate four of 23:07 them if you can accommodate W at a time 23:11 then before you have to service another 23:15 cache miss on C you could go through n W 23:18 iterations out here all right so the C's 23:26 are accessed you first want to use this 23:28 one you use it n times then you want 23:30 that one n times then you want that one 23:32 n times that one n times and so on you 23:34 kind of access the C's by rows ok when 23:40 it comes to the A's okay so I'll first 23:43 do aik okay so in sum row let's say 23:47 we're doing this one okay then k will 23:51 increase by one you'll go to that one 23:52 and then to that one and that one okay 23:55 so the A's are also being accessed by 23:57 row but in terms of the use pattern 24:05 we access an A Okay then on the very 24:09 next iteration we access the next one 24:10 when you access this one let's say W of 24:13 them came into my cache so you had a 24:16 cache miss here but you don't get a 24:18 cache miss there assume W is 4 then you 24:22 don't get a cache miss there you don't 24:23 get it there but you get it over there 24:25 okay now if some later time you're going 24:29 to come back and use this fellow again 24:30 okay so you use that guy if you do this 24:33 role times that okay then when you do 24:37 this row times that you'll use this 24:39 fellow again but by the time you're 24:41 ready to do this row by this fellow the 24:43 cache might have lost this value 24:45 somebody else might have overridden it 24:47 because that's coming much later in time 24:49 okay so in terms of the use pattern for 24:53 every W times you access and a you can 24:56 avoid a cache miss because you can do 24:58 this this this this then you get another 25:00 miss but when you come back to reuse 25:02 this you get a Miss okay so here you had 25:06 an NW reuse factor there you only have a 25:09 W reuse factor well when you go to be 25:15 certain the B I first do ray V KJ but on 25:20 the next time K changes so you go down 25:22 the column okay so you use you're using 25:27 it that way okay so when you use it this 25:29 way so the first time when you want this 25:32 follow you get a cache miss you bring in 25:33 four of these guys then you want this 25:35 guy you get another cache miss you bring 25:37 in for those guys you want this guy you 25:39 get another cache miss you bring in for 25:40 those guys unfortunately by the time 25:43 you're ready to use this that cache line 25:45 might have been overwritten so now 25:48 you're getting a cache miss for every 25:50 time you want to use a B value the reuse 25:52 factor is only one okay 25:54 so C has a very good reuse factor NW not 25:58 so good W horrible one 26:06 all right so okay so the number of cache 26:12 misses there are n cube uses of C but 26:16 every NW causes a cache miss so it's n 26:20 cube over n w that's n square root over 26:23 W cache misses for C over there n cube 26:29 accesses of a but do you have a reuse 26:31 factor of W so the cache misses is n 26:34 cube over W force it be again n cube 26:39 accesses of B but every access of the 26:42 cache miss so it's n cube ok all right 26:47 any questions about this alright so if I 26:55 add up my cache misses I end up with the 26:57 formula down there now let's do the 27:06 other order which I said works best i KJ 27:17 all right so let's look at C okay so in 27:22 here the J index is moving is changing 27:25 very rapidly with each time you come 27:26 here the J it increases by one 27:28 so we're accessing this by row okay but 27:32 you use this on the next round you use 27:34 that then you use that and you use that 27:36 by the time you're ready to reuse this 27:38 you might have lost it okay so this is 27:41 like the a of the previous example the 27:43 reuse factors W so look at the a okay so 27:52 the I is the outermost index okay then 27:55 you have the K value right so if I hit 28:01 this thing end times I'm still accessing 28:03 the same a value okay and once I'm done 28:08 hitting this n times the next time 28:10 around the I guess value of K changes so 28:15 I get the next one here then I use this 28:17 end times and the value of K changes I 28:20 get that one so that's like what is 28:21 happening with C before you're still 28:23 accessing it by a row but every NW times 28:28 that you come here you will get a cache 28:30 miss the be okay that's bkj so on every 28:39 time you come to this line J changes but 28:42 J is a column index now so you're still 28:44 accessing B's boy row okay time you come 28:50 here you get a different fellow okay so 28:57 you get only a reuse factor of W by the 28:59 time you come back to reuse an old one 29:01 it could have been overwritten okay so 29:05 you have a reuse factor of W and W and W 29:12 okay previously it was nww and one so 29:15 we've improved the reuse factor or one 29:18 of the arrays without reducing the reuse 29:21 factors of the others 29:26 okay all right so see had a reuse factor 29:30 of W so you get W n cube over W this is 29:33 a reuse factor of n W and then B had a 29:36 reuse factor of W C of n cube over W if 29:40 you add these up you end up with that 29:42 formula okay so the important thing is 29:46 that in the traditional ijk scheme the 29:50 reuse factor was n w WN 1 and now does W 29:55 and W and W and that accounts for the 29:59 difference in performance that people 30:01 observe all right so we arrived at these 30:08 two equations we can take the ratio of 30:11 cache misses with ijk vs. ikj assuming n 30:16 is large works out to about 1 plus W 30:18 over 2 so if you're dealing with a 32 30:24 byte cache line and 8 bytes per data we 30:30 have 4 pieces of data coming in W is 4 30:32 in which case you could expect a 30:36 performance ratio of about 2.5 on a 30:39 large matrix if you have a 64 byte cache 30:46 line same 8 bytes per data then W 30:50 becomes 8 you can expect a ratio of 4.5 30:55 if you're dealing with integer data for 30:58 bytes then W is 16 you know you get a 31:02 ratio of 8.5 so being aware that 31:09 computers have caches and trying to 31:12 reorganize the computation to optimize 31:14 the user cache can improve the 31:16 performance of your code quite a bit 31:22 okay and and so you using some of the 31:27 ideas from external sword back into 31:28 internal sort could help us with the 31:32 internal internal sword schemes because 31:34 basically as we'll see the idea is an 31:36 external sort are there to reduce the 31:38 number of times you go to disk in the 31:40 case of internal sort you want to reduce 31:42 the number of times you go to main 31:43 memory alright so we can get faster 31:49 internal sorting by using some of the 31:51 ideas we're going to talk about 31:54 experimentally people know that if you 31:57 recode merge sort to use something 32:00 called a tiled merge sort at the high 32:04 level you're doing exactly the same 32:05 thing but you reorganize the code you 32:08 can run two times as fast as the 32:10 textbook merge sort similarly there are 32:17 schemes known for heap sort for instead 32:19 of working with a binary tree of degree 32:21 two you work with a tree of degree four 32:22 you can get a speed-up using essentially 32:25 the same heap sort idea okay all right 32:28 so even though we're going to have our 32:30 discussion in the context of external 32:32 sort 32:32 you shouldn't go back thinking that 32:36 these ideas apply only to the external 32:38 domain they apply also to the internal 32:40 domain all right so let's now turn our 32:45 attention to external sort methods now 32:48 if if I came to you to you and I said I 32:53 need you to write a code to do an 32:55 external sort being experts an internal 32:58 sorting you'd probably say let me take a 33:00 look at the best internal source scheme 33:02 I have and see if I can adapt that to 33:04 the new environment that seems to be 33:06 kind of the logical way to solve a 33:08 problem related to something you know 33:10 how to solve so now in the internal sort 33:14 word we know that depending upon whether 33:17 you're looking for a verage performance 33:18 or worst-case performance the choice for 33:22 method is different so if you're trying 33:24 to optimize the average runtime then 33:26 you'll use quicksort on the other hand 33:29 if you're trying to optimize the 33:30 worst-case runtime you will use a merge 33:32 sort 33:35 so depending upon what your interest is 33:38 you might want to adapt this fellow or 33:40 you may want to adapt that fellow and we 33:43 take a look at adapting both of them 33:49 alright today we will see only how to 33:51 adapt quicksort and then in the next 33:53 lecture we'll see how to adapt would 33:55 sort alright so let's take a quick 34:00 overview of an internal quicksort see 34:03 how that works okay but basically I can 34:06 describe this thing recursively you have 34:08 a collection of records to sort I'm 34:11 showing only the keys the first thing 34:13 you do is you pick a pivot using some 34:15 pivot picking strategy in this case I'm 34:18 picking the leftmost fellow six as the 34:21 pivot and then you petition everything 34:23 based on the pivot so that the pivot is 34:25 somewhere in the center everybody less 34:28 than equal to the pivot is on the left 34:30 side those gredin equal to it on the 34:32 right side okay so i've reorder things 34:35 so that everybody to the left of the 34:38 pivot is less than equal to the pivot 34:40 everybody to the right of the pivot is 34:42 greater than equal to the pivot and then 34:44 i sort this left part recursively and 34:47 the right part recursively okay and 34:52 basically that's all there is to 34:54 quicksort in quicksort this middle group 34:59 is of size one now in our dap tation to 35:05 an external sort the middle group would 35:07 be large okay we'll see how that comes 35:13 about all right so I take the memory 35:21 that's available to me I'm going to 35:24 break it up into components for 35:26 different use so my strategy is you've 35:32 got all of this data sitting on a disk 35:34 and somehow I have to take that data and 35:39 break it up into three groups a middle 35:41 group then a group of small elements 35:44 that's the left group and then a group 35:46 of big elements that's the right group 35:49 everybody in the left group is less than 35:51 equal to everybody in the middle group 35:52 everybody in the right group is great 35:55 niggled everybody in the middle group 35:56 and then I will recursively sort the 35:58 left and right groups assuming the 36:02 middle group is already sorted so that's 36:04 the general strategy so it's pretty 36:06 similar to the internal quicksort except 36:07 that the middle group is now of size 36:09 more than one so in order to accomplish 36:13 this I have to keep note of the fact 36:17 that my data is not in memory but it's 36:19 sitting out on a disk ok so what I'm 36:23 going to do is I will take the memory I 36:25 have and in this simple strategy I will 36:28 break it up into what I'll call three 36:30 buffers these are input or output 36:34 buffers the size of our buffer will be 36:37 the same as the size of a block on disk 36:39 to do any work I got to bring the data 36:41 into memory and then I can do work on it 36:44 okay so when I bring my data I'm going 36:47 to need an input buffer I bring a block 36:49 of data put it into memory I need 36:51 something big enough to hold a block I 36:53 call that an input buffer when I want to 36:55 write something on to disk I'm going to 36:57 need a place in memory where I 36:59 accumulated the stuff so I can write a 37:01 block out I don't want to be writing out 37:03 one record at a time because that's too 37:06 many disk accesses so was it be working 37:09 in units of a block okay so in this 37:13 strategy I'm going to use an input 37:15 buffer for reading from my disk one 37:21 block at a time okay so that's one block 37:26 long then I'm going to collect the left 37:31 part the stuff that is smaller than the 37:34 middle block the left group is also 37:36 going to be collected in memory one 37:37 block at a time okay so I'm going to 37:40 have a buffer that we'll call the small 37:42 buffer okay and that's also one block 37:47 long then I'm going to have a buffer 37:52 called the large buffer and that's where 37:54 I'm going to collect the right group one 37:55 block at a time okay so the idea is 38:00 going to be you 38:01 whenever you need to read data will read 38:05 it in here will process the data records 38:09 that are identified as being part of the 38:11 left group will be collected here those 38:14 identified as being part of the right 38:15 group will be collected here when this 38:18 thing becomes empty 38:19 I'll read in another block from disk 38:21 when this thing becomes full I'll write 38:24 it to disk and then I can start to 38:26 accumulate more of the left group when 38:29 this fellow becomes full I'll write it 38:31 to disk then I can begin to accumulate 38:33 more of the right group or the big group 38:39 the remainder of the memory thus this 38:43 stuff here is going to be used to create 38:46 the middle group right so for this 38:51 scheme to work the assumption is that 38:53 the amount of memory you have is more 38:56 than three blocks one that's a pretty 39:00 reasonable assumption because block 39:02 sizes on this may be of the order of a 39:04 few kilobytes whereas the amount of 39:08 memory you have is of the order of 39:09 hundreds of megabytes or gigabytes okay 39:14 all right so I got this chunk of memory 39:17 here for the middle group as far as the 39:21 male group goes it has to satisfy the 39:24 property that everybody you output here 39:28 must be less than equal to everybody 39:30 here everybody you output through this 39:32 buffer has to be greater than equal to 39:34 everybody here yeah 39:45 well at this point the only requirement 39:49 is that the records of the middle group 39:52 should be less than equal to those in 39:54 the large group and granny glued those 39:55 in the small group once we are done 39:58 okay says as I described the way the 40:00 process works we output this and these 40:03 two groups rough block by block but when 40:07 you have finished processing all of the 40:08 records on the disk you'll be left with 40:10 the whole chunk of records here and then 40:12 you have to write those out and to write 40:15 those out you'll need to sort them first 40:17 so you can write them out as a sorted 40:20 middle group that's correct there'll be 40:28 three partitions the size isn't known 40:31 ahead of time it depends a lot on the 40:34 order of the data just like in quicksort 40:37 there are three partitions that are 40:39 created small middle and large put 40:59 operational delay the things well of 41:00 course they will delay the things and 41:02 the real question is can you do anything 41:04 about it okay so if you're in a multi 41:07 processing system can you control the 41:09 other users no you can't but do you have 41:12 an alternative to performing the task 41:14 that's really the issue if you're on a 41:17 single user system then you can get into 41:20 schemes to try and optimize reading from 41:23 a disk and writing to a disk but in a 41:26 multi-user system where lots of people 41:28 are sharing the disk you can't control 41:29 the position of the disk head 41:38 so there is no efficient way known to 42:13 partition into equal sized groups okay 42:16 so whenever you petition you pick a 42:18 pivot you either pick a random Parvati 42:20 take a million of three or more 42:21 sophisticated rules like median of seven 42:23 million of nine or whatever but they 42:25 don't guarantee equal size partitions 42:28 okay second is that's if you forget 42:34 about how you choose the pivot the the 42:37 balance of the strategy is based on an 42:39 internal scheme which means the whole 42:42 data sitting in there which isn't 42:43 practical here okay so all right so you 42:52 got your disk out there so as I 42:57 explained the way it's going to work is 42:58 to initialize the system you just read 43:01 as many blocks from disk as needed to 43:04 fill up your middle group whatever 43:05 member you got there then you read in 43:07 one more block to fill up the input 43:09 buffer okay so that's the initial 43:11 configuration and then you start 43:15 processing so you look at the next 43:18 record in the input buffer if it's less 43:22 than equal to everybody in the middle 43:23 group you put it out here okay if it's 43:27 greater than equal to everybody in the 43:28 middle group you put it into the large 43:31 buffer if it lies in between okay then 43:38 you can't put it here and you can't put 43:40 it that okay 43:41 so then you have to either take the 43:43 smallest element here and move it there 43:45 or take the 43:47 element here and move it there and then 43:49 the new element can be pushed in to the 43:51 middle group okay so there are three 43:54 cases and these three things strategies 43:58 cover all of those cases while you're 44:09 processing this these three steps you 44:13 really going to have a loop this is 44:15 basically what you're going to be doing 44:16 throughout okay so in this process the 44:19 input buffer may become empty so then 44:21 you got to stop to refill it this may 44:24 become fully got to stop to output it 44:25 that may become fully got to stop to 44:28 output it okay then you can continue 44:31 once you're done whatever is sitting in 44:36 this middle group can be sorted and 44:39 written out using an internal sort and 44:41 then you recursively sort the small 44:44 group and the large group the so the 44:54 issue that arises here now is what data 44:57 structure do you use to implement the 44:59 middle group so the operations you need 45:04 to perform on the middle group R and C 45:08 we've gone through this okay but the 45:13 operations you perform the middle group 45:14 when you take somebody from here you got 45:17 to insert it you may have to insert it 45:19 here okay because it lies between the 45:23 things oh oh maybe I should go in the 45:27 order in which I had them in the 45:28 previous slide okay you may take the 45:33 smallest item let's see so you take 45:38 somebody from here if it's less than 45:40 equal to the smallest one and goes out 45:41 there so that's easy you just put it at 45:43 the end of that buffer okay if it's 45:45 greater than equal to everybody here you 45:46 put at the end of that buffer otherwise 45:49 this item has to be inserted here but 45:51 you insert it here you got to take 45:53 somebody out because this guy is full so 45:56 you may want to take the minimum item 45:59 out 46:00 you may want to take the maximum item 46:02 out in order to get a roughly equal 46:05 small and large group you would probably 46:09 have some heuristic which says if 46:11 currently the small group is bigger than 46:13 the large group take out the max element 46:15 and put it here 46:16 if currently the small group is smaller 46:18 than the large group take out the main 46:20 element and put it here in an attempt to 46:22 balance the size of these two groups 46:24 okay 46:25 so I'll need to be able to remove 46:27 sometimes the men sometimes the max once 46:30 you've done the remove unit you will to 46:31 do an insert okay so you need to support 46:35 what's called a double-ended prior to Q 46:37 you need evil to insert you need evil to 46:39 remove the men you also need to build to 46:41 remove the max okay so to implement this 46:45 I need a double ended priority Q okay so 46:49 there'll be one application for double 46:50 ended priority queues another possible 46:55 application for project queues might be 46:58 in a queue in a network for example you 47:00 have a finite size queue and you have 47:03 packets coming in so when a packet comes 47:07 in it gets buffered into the queue 47:09 waiting for the output line to become 47:10 idle when the output line becomes idle 47:14 you take the highest priority packet out 47:16 of the buffer and put it onto the queue 47:17 so you want to do a delete max when 47:22 somebody comes in you want to do an 47:24 insert but when somebody is coming into 47:26 the buffer and the buffer is full then 47:29 you probably want to drop the lowest 47:30 priority packets you want to do a delete 47:32 min so you could implement a network 47:35 buffer as a double-ended priority queue 47:46 now in the sword application we could 47:50 improve the performance of quicksort 47:52 okay 47:53 so while you're doing whatever we said 47:57 if we had another input buffer I could 47:59 be simultaneously reading a buffaloed 48:01 instead of idling on my i/o channel okay 48:05 so then I could switch between the 48:06 buffers same thing here 48:08 okay I could be I could have two small 48:12 buffers while you're writing this you 48:15 can still continue to be filling the 48:16 other one I could have two of these 48:18 buffers while you're writing this one 48:20 you could continue on the other one you 48:21 don't have to stop okay so you get this 48:25 idea of having more buffers than as 48:26 essential so as to improve performance 48:31 that issue is more interesting when we 48:33 look at void sort we'll do that later 48:38 all right so I guess that's kind of a 48:41 summary of some of the ideas behind 48:44 external sword especially behind the 48:46 adaptation to quicksort and also why 48:49 what we're going to be doing apply is 48:52 not just to external sorting but also to 48:54 internal sorting and also to internal 48:57 computing in general trying to optimize 49:00 cache utilization any questions 49:09 okay we meet Wednesday 49:27 you