Transcript 00:06 okay alright last time we started to 00:10 talk about external sorting and I guess 00:13 we're going to continue with that with 00:17 the discussion of external sorting 00:18 actually for several lectures before I 00:21 continue do you have any questions on 00:23 anything we've done so far yeah I guess 00:45 the question is last time when I spoke 00:46 about caches where multiple level of 00:48 caches and I didn't take into account 00:51 the time needed to move data from one 00:53 cache to the register to perform the 00:55 arithmetic well then the numbers I had 00:59 up there 01:00 you can either interpret that as saying 01:02 say if I said it's 100 cycles to get 01:06 data from memory to l2 okay you could 01:09 interpret 100 as in 100 cycles you can 01:13 get the data from memory to l2 and also 01:16 into l1 and the registers okay 01:20 alternatively you can add the 10 cycles 01:24 to go from let's say l2 to l1 and then 01:28 whatever it was it's not going to make a 01:30 whole lot of difference okay so in my 01:33 analysis I just assumed that the hundred 01:35 would cover everything you could get it 01:36 from memory all the way down to the 01:39 register 01:48 like what about the searching time 01:50 required what they are that you can 01:52 resume assault taking a discount 01:56 irrespective of the searching that can 01:58 be used over there well again you know 02:02 we did an example to explain how things 02:04 work we didn't go into the circuitry and 02:07 whether you're using an LRU method to 02:11 invalidate caches and stuff okay we 02:14 didn't go into that level of detail okay 02:16 the purpose of that discussion was to 02:18 show that there are caches it costs time 02:22 to get data what it cost depends on 02:25 whether you're getting it from l1 cache 02:27 or l2 of memory and the techniques used 02:30 for external sorting therefore can be 02:33 adapted to the internal computations 02:37 okay we certainly did not go into a 02:41 level of detail to explain the different 02:44 cache replacement strategies the 02:46 different cache access mechanisms and so 02:48 on okay so it doesn't it's kind of 02:53 futile to think about does that include 02:56 the details of this search does it 02:58 include this particular mechanism or a 03:00 doesn't all right all right other 03:06 questions 03:09 now the the lectures are being recorded 03:15 as you've noticed and they are available 03:17 through streaming video from the 03:19 Internet in case you miss one of the 03:22 lectures or you do want to go back and 03:24 review a past lecture I've updated the 03:27 link in our course website to get you to 03:30 the place where the lectures are 03:32 available and to get to the recorded 03:36 lectures you will need to login using 03:38 your gatorlink accounts okay all right 03:42 so as I said we're going to continue 03:45 with our discussion of external sorting 03:48 and as I said last time a good strategy 03:53 to develop an external sort is to adapt 03:57 what you already know from internal 03:58 sorting so you take the best internal 04:00 sort methods you're aware of and adapt 04:03 them to the external sort environment 04:05 you know so with that as a strategy we 04:09 took a look at the best sort method we 04:12 knew for average performance quicksort 04:14 and we saw how to adapt that and that 04:17 adaptation in reveal the need for data 04:21 structures to represent a double-ended 04:22 prior to cue it also introduced the 04:25 concept of buffers portions of memory 04:27 into which you could read a block of 04:29 data from your disk or portions of 04:31 memory into which you can assemble a 04:33 block of data that you'll eventually 04:35 write to a disk and it also revealed the 04:40 idea of trying to do things like 04:43 input/output and internal processing in 04:45 parallel the simple discussion we had we 04:49 had to stop working when an input buffer 04:51 became empty or we had to stop working 04:54 on one of the output buffers became full 04:56 but then toward the end I said if you 04:58 had double buffers two input buffers two 05:01 output buffers for the left group and 05:04 two output buffers for the output group 05:06 then while you were filling one input 05:10 buffer you could be using or consuming 05:12 from the other input buffer while you 05:14 were writing one output buffer you could 05:16 be filling the other one from internal 05:21 processing 05:22 okay all right so so that discussion 05:25 introduced several new concepts all 05:30 right today we're going to take a look 05:32 at Mozart which is the best sort method 05:35 for from the standpoint of worst case 05:37 performance I will see how we can adapt 05:41 that to the external environment and see 05:44 what additional data structures may be 05:46 needed to support an external sort all 05:51 right so let's first take a quick review 05:54 of internal merge sort internal merge 05:59 sort essentially works in two phases or 06:02 two steps in the first phase you create 06:05 sorted segments okay so for example if 06:10 you're using something called a natural 06:11 mode sort then the naturally occurring 06:14 sorted segments in your collection of 06:17 Records to be sorted are used okay so 06:20 you simply sweep your input file from 06:22 left to right and wherever there is a 06:24 decrease in key that marks the end of a 06:26 segment another way to create the 06:31 initial sorted segments is to disregard 06:33 the naturally occurring sorted segments 06:35 and to just take blocks of some fixed 06:38 size maybe you take 20 records at a time 06:41 sort them using internal sort and then 06:43 you get segments of size 20 that are 06:45 sorted okay so you start by creating 06:50 sorted segments and those are two common 06:54 schemes used to create the initial 06:56 sorted segments and then in the second 06:59 phase you merge together pairs of sorted 07:02 segments so when you merge two segments 07:06 together you replace those two with one 07:08 you repeat this process you'll be left 07:11 with a single sorted segment and this 07:13 second phase actually works in something 07:16 called a pass okay so if you started 07:19 with ten sorted segments at one pass you 07:22 reduce the ten segments to five then the 07:25 next pass you reduce the five to three 07:27 then the three to two and the two to one 07:30 okay all right so that's basically how 07:33 an internal merge sort works 07:36 now we're going to try and do an 07:40 external merge sort in essentially the 07:43 same manner have two phases in the first 07:46 phase we will create sorted segments in 07:48 the context of external sorting will 07:51 refer to a sorted segment as a run and 07:53 then the second phase will merge 07:55 together pairs of runs in passes all 08:02 right so let's illustrate how this works 08:05 by looking at an example so in our 08:07 example we want to sort 10,000 records 08:11 however the computer we have has 08:14 sufficient memory to hold 500 records 08:20 will assume that the block size is 100 08:25 and then for the analysis we'll make use 08:30 of some variables to denote how much 08:32 time different tasks take so T i/o will 08:36 represent the amount of time it takes to 08:38 either read in or to write out one block 08:41 of data so that includes the seek time 08:43 and the latency time as well as the 08:45 transmission time again the analysis for 08:49 example could assume the worst case 08:51 scenario where you paid the maximum seek 08:54 so you're moving from one end of the 08:56 disk all the way to the other end of the 08:57 disk the maximum latency you have to 09:00 wait for a full rotation to take place 09:04 we will make use of T is that's the time 09:07 to internally sort a memory load of data 09:09 so in this case will be the time to sort 09:12 500 records similarly T i/o is the time 09:17 to input 100 records one block then 09:22 we'll use GIM as the time to internally 09:25 merge a memory load so time to 09:29 internally merge one block load of data 09:32 okay so to create 100 records in the 09:36 output file 09:42 all right so there's two phases you have 09:45 the run generation phase and a run is 09:48 just what we were calling a sorted 09:50 segment before and then the run merging 09:53 phase which is what was phase two of the 09:57 internal sort where we will merge 09:59 together pairs of runs until we left 10:01 with a single run all right so let's 10:06 look at the first phase that's run 10:08 generation so here's my memory it's got 10:11 enough capacity to hold 500 records and 10:14 when I say it's got enough capacity to 10:15 hold 500 records what I really mean is 10:17 you can hold the 500 records plus a 10:19 program plus other variables so that you 10:23 can run the program okay but basically I 10:26 have the memory to sort 500 records 10:28 using some internal source scheme so 500 10:35 records represents five blocks of data 10:39 I've got 10,000 records sitting on disk 10:42 and that represents 100 blocks of data 10:48 all right so the strategy is going to be 10:50 you read in five blocks so that fills up 10:53 your memory you sort the five blocks 10:58 using say an internal merge sort and 11:04 then you write out these five blocks as 11:07 a sorted segment or a run okay so this 11:15 strategy is going to create runs whose 11:17 size is five blocks long all right so 11:20 this creates one run and then basically 11:23 we're going to repeat this thing 20 11:25 times and the result is going to be 100 11:32 sorry the result is going to be 20 runs 11:34 each run being 5 blocks long 11:41 any questions about the strategy to 11:42 create runs you see that we have enough 11:45 memory to do this alright so let's look 11:50 at the time it takes so you want to 11:53 input five blocks okay so that's five TI 11:56 o okay so this is the time for iteration 12:00 of this loop here okay so five tio2 12:03 input those blocks you want to sort so 12:06 that's one T is time to sort of memory 12:09 load then you want to output those five 12:11 blocks so that's another five TI o okay 12:16 so to create one run takes you ten TI o 12:19 plus T is then we're going to do this 20 12:23 times so it becomes 200 TI o plus 20 t 12:27 is all right so using this strategy we 12:33 can create runs each run is five blocks 12:36 long and the amount of time we're going 12:39 to spend is 200 TI o plus 20 t RS okay 12:49 we in agreement on life okay 12:55 well let's go to run merging okay second 12:59 phase so again we'll use the concept of 13:02 a merge pass and in a merge pass we will 13:05 read in all of their runs and reduce the 13:08 number of runs by a factor of two all 13:13 right so the strategy is then to 13:16 pairwise merge these 20 runs bring it 13:19 down to 10 we'll see in a moment how we 13:22 actually accomplish this but at a high 13:24 level the strategy is to do this 13:26 pairwise merge the 20 runs into 10 ok 13:31 and we use the term merge pass to 13:35 represent an amount of or a process in 13:38 which all the current runs are read in 13:40 merged and the number of runs you are 13:42 left with is one half of that and then 13:48 we're going to repeat this process four 13:51 times so they have a picture here to 13:55 show what's going to happen all right so 13:57 I start with 20 okay so in the first 14:00 much but when t2 10 by merging together 14:04 pairs of runs 14:06 so the ours become the esses so each R 14:12 is 5 blocks long now each s will be 10 14:15 blocks long in the second pass I will 14:19 merge the esses in pairs to get the t's 14:22 and the number of T's will be 5 so each 14:28 s is 10 blocks long each T is going to 14:30 be 20 blocks long in the third pass 14:35 pairs of T's will be merged to get the 14:37 use and since you have an odd number of 14:39 T's the last one T 5 will not have a 14:43 pair to merge with so we just copy that 14:45 over as a you 14:51 and so now the number of runs reduces 14:53 from five to three the T's were 20 14:59 blocks long the use will be 40 blocks 15:01 long except the last one which is only 15:03 20 blocks long I have another round of 15:09 Mojang's another merge pass the use will 15:12 become V's from 3 L go to 2 and the V's 15:21 are now 80 blocks long except for the 15:23 last one which is only 20 and then a 15:27 final merge pass in which the vs will 15:30 get replaced by a single W so in the 15:36 second phase I merge together pairs of 15:38 runs the merging is done in passes each 15:42 pass reduces the number of runs by a 15:45 factor of 2 roughly at the same time it 15:48 doubles the size of a run all right so 15:53 that's the overall strategy and that's 15:56 the same as what's done in an internal 15:58 in the second phase of an internal mode 16:00 sort how do we merge a pair of runs 16:17 because each run is of size larger than 16:20 the capacity of memory and and that's 16:23 something we're to look at in a moment 16:24 so this point I just wanted to make 16:26 clear how we plan to do this at a high 16:28 level and so what to implement this as 16:31 you've correctly noted we need to figure 16:33 out how do we implement one of these 16:35 pairs either here or there or there okay 16:38 at each node of this merge tree how do 16:40 we actually do that much okay and that's 16:42 what we want to see right now all right 16:47 other questions about the overall 16:49 strategy 16:53 all right so let's take a look at how we 16:56 might merge together a pair of runs all 17:02 right so here's our memory configuration 17:06 I'm going to partition memory into 17:10 buffers like I did for quicksort okay so 17:15 I'll create a designate a portion of 17:19 memory as an input buffer input 0 and 17:22 this will be one block long so 100 17:26 records of memory and it's going to be 17:28 used to input run 1 one block at a time 17:35 then I'll have another portion of memory 17:37 designated as an input buffer call it 17:39 input 1 and that's going to be used to 17:42 read in run to one block at a time 17:45 I still have 300 Records worth the space 17:52 of take another 100 records worth of 17:55 space and does it make that as an output 17:57 buffer in which I will assemble a block 18:01 of s1 okay so this will be used to bring 18:07 in our one one block at a time that'll 18:09 be used to bring in our to one block at 18:11 a time 18:12 internal merge will assemble s1 here one 18:16 block at a time if this becomes empty 18:21 will replenish this with the next block 18:23 from r1 if this becomes empty will 18:27 replenish this with the next block from 18:29 r2 if this becomes full I will output 18:32 this block of s1 to disk and once I'm 18:34 done with the output I can continue to 18:36 assemble the next block of s1 all right 18:42 so as I said this service is our one 18:48 input one service is our two and this 18:51 collects s1 to start the process I've 18:57 read in the first block of r1 I read in 19:00 the first block of r2 to do the merge I 19:04 have to look at the first records in 19:06 these two blocks the smaller one 19:07 moves here okay then you look at the 19:12 first unmoved record from here and there 19:14 and the smaller one moves there okay and 19:17 you keep doing this until one of these 19:19 becomes empty or this one becomes full 19:23 okay so when a buffer becomes full you 19:30 write it out so you have to stop 19:31 generating stop internal merging you 19:35 write it out when a buffer becomes empty 19:37 you have to stop internal merging you 19:40 get the next block from the disk alright 19:45 so so long as you have enough memory for 19:47 three blocks of data you can merge 19:51 together two runs independent of how 19:55 long the runs are 20:05 all right any questions about how we 20:08 merge together a pair of runs 20:10 independent of their size provided 20:13 you've got three blocks of memory yeah 20:21 [Music] 20:27 let's see if the output buffer becomes 20:30 full and you got half here and half 20:32 there well that's not a policy you got 20:35 half in half here you leave that okay at 20:37 this time we're just going to write this 20:39 one out okay 20:40 once you finish writing it then we'll 20:42 continue with this half and that half 20:44 okay so there's no requirement that all 20:48 three events happen at the same time 20:50 that this gets full and these two become 20:52 empty at the same time okay all right 20:55 all right another question there was 20:58 something coming up somewhere no all 21:05 right let's look at how much time it 21:08 takes all right each of r1 and r2 is 21:13 five blocks long to do that merge I just 21:19 described all five blocks of r1 and r2 21:21 are read in there are different times so 21:24 the total input time is 10 T i/o all are 21:29 written out there are different times 21:32 one block at a time so there's ten 21:34 blocks of writing so that's ten T i/o 21:36 and then all ten are internally merged 21:41 from input buffers to output buffers so 21:43 that's 10 T I am 21:51 so it takes 20 iOS and 10 IMS to merge R 21:58 1 and R 2 into s 1 that includes the 22:02 input the output the internal processor 22:09 alright so the time for the first pass 22:13 ok so in the first pass I'm going to do 22:17 this thing 10 times ok first R 1 R 2 22:20 then R 3 R 4 and finally our 19 out of 22:23 20 so I wouldn't do it 10 times so I'll 22:27 spend 10 times that much time alright 22:35 then we're going to do the next pass so 22:39 that's going to merge the SS into T's 22:41 same strategy as before except now each 22:46 s 1 is 10 blocks long so to read in s 1 22:51 is 10 T IO to read in s 2 is 10 T IO in 22:54 the merge all 20 blocks will get right 22:57 in there are different times so the 23:00 input time is 20 t io the output time is 23:04 the same as the input time everything 23:06 you read then you got to write out 23:08 everything you read then has to be 23:10 merged from input to output buffers so 23:13 20 TR m so the total time to do a pair 23:18 when you're going from s to T's is 40 I 23:22 Oh s and 20 I am s 23:32 you had to go from s to T's there are 10 23:35 SS so we'll have five pairs of pairwise 23:39 merges then it becomes five times that 23:41 so that's 200 TI o plus hundred TI m 23:45 same as what we had when we went from 23:47 ours to SS but in general in a merge 24:00 pass you read in all of the data 24:04 guess there's hundred TI oh okay you 24:08 write out all of the data those 100 TI o 24:11 and you move all the data from input 24:13 buffers to output buffers so that's 100 24:16 TI m so independent of which pass your 24:21 rent you spend roughly the same amount 24:24 of time 200 TI o plus hundred TI m i say 24:27 roughly because if you have an odd 24:29 number of fronts then this last run 24:31 maybe handle a little bit different but 24:37 approximately each pass takes the same 24:39 amount of time all right so the total 24:49 time you spend merging together runs is 24:52 the time for a merge pass okay so that's 24:56 time to read in all the records write 24:58 out all the records move all the records 25:00 from input to output buffers times the 25:02 number of passes that you perform and 25:07 the number of passes that you do is log 25:11 of the number of runs that you started 25:13 with okay so in our case we started with 25:15 20 runs so it becomes log of 20 base - 25:20 you take the ceiling of that should work 25:22 out to about 5 ok so you have 5 passes 25:25 and this is the time for a single pass 25:35 alright so the strategy that's used for 25:39 internal mergesort can be adapted to an 25:42 external mood sort the only requirement 25:46 here is that you should have enough 25:48 internal memory for three blocks or any 25:55 questions about the adaptation and 25:58 whether or not you in fact can use the 26:00 strategy to sort arbitrarily large 26:03 collections of Records alright so let's 26:16 take a look now at the factors that 26:18 contribute to our external mergesort 26:21 time and then we can focus on these 26:24 factors to see if there's a way to 26:25 reduce the components that make up the 26:28 total time all right so run generation 26:35 took this much time and the components 26:38 there this is as the internal sort time 26:41 time to sort one block of memory there 26:46 is the input and output time that's the 26:48 200 T i/o that we're spending as far as 26:55 run merging went we have the time for a 27:03 pass times log of the number of runs 27:07 base two so the internal much time is a 27:14 contributor T I am how much time does it 27:16 take to move a block from input buffers 27:20 to output buffers the total time that 27:23 you spend in putting and outputting it's 27:27 a 200 T i/o times log of 20 bays to the 27:34 number of initial runs in this case 20 27:36 so if you started with a smaller number 27:39 of runs you'd have to make fewer passes 27:46 okay and the other factor here is that 27:49 base two in the logarithm okay we have a 27:53 base two in the logarithm because we 27:55 were merging pairwise so that's the 27:58 order of the merge we're doing a two way 27:59 merge conceivably we could try to merge 28:02 say four runs together into one instead 28:06 of two into one and that would reduce 28:09 the number of passes all right so we've 28:16 seen the factors let's go back and look 28:18 at these factors again and see what we 28:20 might be able to do to improve or reduce 28:22 the total time spent both in run 28:25 generation and then merging all the runs 28:29 together the objective of course is to 28:31 reduce the overall time okay we don't 28:35 really care how much time is spent in 28:36 the first phase and how much is spent in 28:38 the second phase individually but the 28:40 sum of the two is what we want to reduce 28:44 all right so to improve one generation 28:47 one component there was the total time 28:52 spent input and output okay so we could 28:57 try and or in fact if we look at the 29:00 formula we had the input time to that we 29:03 added the output time to that we added 29:05 the time spent internally sorting okay 29:08 so we could think of trying to do these 29:10 three activities concurrently rather 29:14 than see roli so it shouldn't be that 29:16 while you're inputting you can't be 29:18 outputting while you're outputting you 29:20 can't be generating part of the output - 29:24 for the next round of writing okay so we 29:28 might try and see if there's a way to do 29:31 these three activities in parallel or at 29:34 the same time overlap the time spent in 29:36 those three so in that case instead of 29:39 adding the time spent on input to the 29:42 time spent and output to the time spent 29:43 internal sorting maybe we'll be taking 29:46 the max of these three quantities okay 29:50 and certainly in order to overlap your 29:54 hardware has to support 29:56 doing activities in parallel alright so 30:01 we'll look at the possibility of 30:02 overlapping those activities okay 30:05 now another component we put down there 30:07 was the time to sort a block load of 30:10 memory it doesn't look like you're gonna 30:13 be able to do much there because that's 30:14 internal sorting and we already starting 30:16 with we said the best internal sort 30:18 scheme our component that we can work on 30:24 here which didn't show up in the time 30:29 needed for run generation but effects 30:32 the second phase that's the number of 30:33 runs so when we looked at the factors 30:38 for the second phase number of runs that 30:40 you started with as a factor and that's 30:42 determined by this fellow here okay now 30:47 in the strategy we've used right now the 30:51 number of runs were simply total number 30:52 of Records divided by capacity of your 30:54 memory but we may think of alternate 30:59 strategies to generate runs where 31:03 perhaps you can end up with a smaller 31:04 number of runs all right so so we want 31:13 to look at two things one is overlapping 31:14 okay as far as overlapping is concerned 31:17 I said it depends on with your hardware 31:19 support set a minimal hardware 31:22 configuration good it need is there 31:24 should be two disk drives around you 31:28 can't be reading and writing from the 31:31 same disk at the same time because the 31:33 head cannot be at two different places 31:36 okay so you're going to need two 31:38 different disk drives you can be reading 31:41 from one and writing to the other also 31:46 your hardware needs to support what we 31:48 call a non-blocking read and a 31:51 non-blocking write but non blocking I 31:53 mean that while you are writing a buffer 31:56 load here you that should not block all 32:01 of the other operations of the computer 32:03 okay so I should be able to initiate a 32:06 command which says write this block as 32:08 soon as the command is initiated you go 32:10 to the 32:10 next instruction which could be read a 32:13 block from disk and when that is 32:16 initiated you can continue with internal 32:19 arithmetic logical operations okay so 32:23 we're going to assume that you have two 32:24 disks and also that you have 32:27 non-blocking reads and writes 32:47 right this is going backwards 33:04 okay alright then said we want to be 33:09 able to generate a smaller number of 33:13 runs then by this simple strategy we 33:15 just discussed and that's the same as 33:17 saying we want to generate longer runs 33:19 than the capacity of the memory and 33:21 we'll see in the next lecture that well 33:26 perhaps not a little expert in a 33:28 subsequent lecture that by changing our 33:31 strategy we in fact can generate runs 33:33 whose length on average is two times the 33:35 size of memory okay so we can reduce the 33:39 number ones by a factor of two on 33:42 average but let's go to the second phase 33:49 yeah like our and our ten because 34:37 right the the comment is that if instead 34:42 of merging say r1 and r2 together we 34:44 started by merging r1 and our 10 that 34:47 would have an impact on the time it 34:48 takes to compare a pair of keys in the 34:51 merge and perhaps you could reduce the 34:55 run time in that fashion you know two 34:57 things to keep in mind with respect to 34:59 that is one is that there is no reason 35:02 to believe that the keys and r1 are a 35:07 whole lot different from the keys in our 35:08 10 versus those in r2 because there's no 35:12 relationship between r1 r2 r3 in our 10 35:15 you started off with a collection of 35:17 Records and there's no reason to believe 35:20 that records that are that have similar 35:22 keys are closer to each other than 35:24 records that have different keys okay we 35:26 start with an unsorted sequence okay all 35:28 right so therefore whether you do r1 and 35:30 r2 or r1 and r3 or r1 or 10 35:33 statistically you don't expect any 35:35 difference okay now the second thing 35:38 which has to do with actual numbers I 35:41 mean suppose you had two numbers and you 35:47 wanted to compare them so if the two 35:50 numbers were let's say they differed in 35:53 the most significant bit versus being 35:56 the same in the first 15 bits and then 35:58 differing on the 16th but is there a 36:00 difference in how much time it takes 36:01 that would of course depend on what 36:03 compact our comparison architecture or 36:06 hardware you've built what is the 36:08 comparison algorithm are you doing a 36:09 bitwise compare in series because no 36:12 computer does that okay otherwise the 36:16 time to do a compare would increase as 36:19 you increase the word size okay so and 36:23 so even that's really not true though 36:26 there is variation in the time it takes 36:28 to do a compare the variation is not 36:31 that large it's not like going from 1 to 36:34 a factor of 32 if you have a 32-bit 36:36 machine ok ok so 36:42 let's go to the second phase which is 36:43 the run merging phase so here - we can 36:47 think about trying to do the three 36:52 things input output and internal merging 36:54 in parallel overlap the time spent in 36:57 these rather than in our simple scheme 36:59 where when you were in putting a 37:01 buffaloed you could not be writing you 37:03 couldn't be internal merging okay 37:06 certainly to do this you're going to 37:08 need a lot more buffers and you're going 37:10 to need a buffer management strategy all 37:17 right so as was the case for the first 37:19 phase you're going to need the same kind 37:21 of hardware you're going to need two 37:23 drives so that you can read and write at 37:25 the same time you're going to need non 37:27 blocking read and write and we make the 37:30 assumption that the hardware is going to 37:32 support this okay all right then the 37:38 number of passes was important this was 37:41 determined by the order of the merge our 37:43 example did a two way merge we could 37:45 think of doing a higher-order merge okay 37:47 in general the number of passes is log 37:51 base K if you do a K way merge so for 37:56 example with our twenty runs if we did a 37:58 five way merge then going from ours to 38:02 ss-20 ours would reduce to four s's okay 38:06 so we do five sets of ours of course to 38:13 do this we will need five input buffers 38:15 and an output buffer in a minimal 38:20 configuration using the simplistic 38:22 scheme that we discussed so you got to 38:25 have enough memory to do that or you 38:27 have to be able to reduce the block 38:28 sauce alright so in this case we just 38:33 have two passes the second pass would 38:36 become a four-way fingers you don't have 38:38 5s is too much 38:48 right so if you reduce the number of 38:50 passes then the benefit you get isn't 38:57 corresponding to the reduction in number 39:01 of passes necessarily okay 39:03 so let's do a little bit of an analysis 39:05 here okay as I said if you're going to 39:07 do a keyway merge then the number of 39:10 buffers you need you need K input 39:12 buffers plus you need one output buffer 39:15 okay so the number of buffers goes up 39:18 linearly in the order of the merge but 39:23 the size of your memory is fixed so at 39:25 some point when you increase K beyond 39:28 that point you're going to have to 39:30 reduce the block size now let's assume 39:32 your operating system lets you reduce 39:34 the block size once you reduce the block 39:43 size the number of blocks that have to 39:45 be read in and a pass goes up so if a 39:51 block is going to be smaller the number 39:52 of blocks goes up so each pass reads 39:54 more blocks in which case there are more 39:56 Sikhs and Layton sees that you had to 39:58 pay so the input time per pass goes up 40:03 the number of passes may go down where 40:05 the input time per pass will go up so 40:10 you end up looking something like this 40:12 this is a time per pass okay initially 40:16 as you go from K equals 2 to 4 to 8 to 40:18 16 you don't see any change in the time 40:21 per pass because the block size is 40:24 sufficiently small that you can 40:26 accommodate the increased number of 40:28 buffers 40:28 in your memory ok but at some point you 40:32 have to reduce the block size in which 40:35 case the number of blocks you read in 40:37 per pass goes up and so the input time 40:42 or output time goes up per pass 40:50 so the question then is well all right 40:53 so we see that the the i/o time is going 40:55 to go up or pass after a point what does 40:58 this mean to the total time for the 41:00 second phase so we need to look at the 41:06 total time okay is given by the i/o time 41:12 for one merge pass ok total i/o time is 41:15 IO time for a pass times the number of 41:19 runs you take the log of that base okay 41:23 okay so initially in the previous graph 41:29 we had a flat portion okay so this time 41:33 is flat but this is a reducing thing the 41:37 number of passes is going down so you 41:40 see a reduction in the total IO time 41:42 okay after a point even though this is 41:45 going down okay you increase in case you 41:49 log of number of runs base K is still 41:52 going down but that fellow is going up 41:53 and the net effect is an increase in the 41:56 IO time so you can get a reduction in 42:02 total IO time up to some critical value 42:05 of K what happens to the internal time 42:15 spent doing the merges as you increase K 42:18 a simple way to do the merge when you go 42:25 to a higher order merge is to just 42:28 extend what you were doing for a two-way 42:30 merge in a two way merge you would 42:31 compare the first two keys and move the 42:34 smaller one out there okay so now in 42:38 this case I'm doing a six-way merge I 42:39 got six records I'll do five comparisons 42:43 amongst the six front keys and these 42:46 buffers and the smaller one goes out so 42:52 doing K minus one compares you can find 42:56 the smallest of K items and push the 42:59 smallest one to the output buffer okay 43:05 so your time to merge okay in a pass if 43:10 you start with end records in our case n 43:12 was 10,000 all N will end up in the 43:15 output buffer okay the the number of 43:18 comparisons needed to move a record from 43:20 the input buffers to the output buffers 43:22 is K minus 1 and all of them go that way 43:25 so it's K minus 1 times n compares and 43:28 the C factor there 43:29 converts that compares to say 43:32 microseconds or something okay so that's 43:37 the amount of time it takes to do a 43:40 merge pass internally you multiply that 43:46 with the number of merge classes so 43:49 that's log of the number of runs number 43:53 of runs is hour so log of our base K and 43:58 then you change the log of our base K 44:02 replace it with log of our base 2 44:04 divided by log of K base 2 so you end up 44:08 with this expression here on the right 44:10 and if you look at this thing the 44:12 important term is this one here K 44:14 divided by log of K base 2 which tells 44:17 you that as you increase K the internal 44:20 much time goes up it's not going up 44:23 linearly in cables pretty close to that 44:25 it's K over log of K base 2 44:34 okay alright so what we've seen is that 44:36 one as you increase K beyond a certain 44:40 point you would either have you have to 44:42 reduce the buffer size so you you want 44:45 to be you want to kind of conserve the 44:48 number of buffers if you're trying to go 44:50 into any sophisticated buffer management 44:52 scheme okay to is as you increase K the 44:57 internal much time goes up okay total 45:02 amount of time that you spend merging so 45:04 that's not good either okay now we can 45:06 remedy this by changing the strategy 45:09 used to do a K way merge we use a data 45:14 structure called a tournament tree so in 45:16 between our runs and the output buffer 45:18 you put a tree which is used to select 45:22 the next fellow that moves into the 45:24 output buffer we will see that when you 45:28 use a tournament tree the time required 45:32 to move a record from an input buffer to 45:34 the output buffer instead of being of 45:36 the order of K becomes of the order of 45:39 log K so if you have to move n records 45:44 from input buffers to the output buffer 45:46 as you do in a pass then the time 45:49 becomes D times n log K D some constant 45:52 factor which converts comparisons to 45:54 microseconds so we get a get an 46:00 improvement here previously we had K 46:03 minus 1 as our factor here that drops to 46:06 log of K and what that does is when you 46:11 plug it into the total time as we did 46:15 before the K vanishes from the total 46:19 time okay so time per pass multiplied by 46:24 the number of passes log of our base K 46:26 and then we replace log of our base K 46:29 with log our base 2 divided by log K 46:32 base 2 the Lord KS cancel out and you 46:35 get just a n log our base 2 term ok so 46:42 if instead of using the simple strategy 46:45 too much you make use of a tournament 46:47 tree we can make the time for the 46:50 internal merging portion insensitive to 46:54 K now you do have a penalty when you 46:57 switch from the simple scheme to this 47:00 scheme so if you tried this scheme with 47:01 K equals 2 47:03 there would be a overhead penalty 47:06 because the D here is bigger than the C 47:08 that we're using there but once you make 47:10 the switch it doesn't matter whether 47:12 you're dealing with a K of 8 or a K of 47:14 16 or okay of 32 it all takes the same 47:17 amount of time all right so to implement 47:25 merge sort effectively we need to have a 47:29 good buffering strategy we don't want to 47:32 use more buffers than needed the good 47:36 buffering strategy should support 47:37 parallel input output and internal 47:40 operations we want to use tournament 47:43 trees to do the K way merge alright so 47:48 next time what we'll do is we'll start 47:49 by looking at tournament trees and then 47:51 see how they can be used both for one 47:54 generation okay give us runs whose size 47:58 and average is two times that of memory 47:59 and also for our K way merge and then we 48:03 look at buffering strategies okay