Transcript 00:16 okay all right any questions before we 00:19 start today's lecture all right so we're 00:26 good to continue with our discussion of 00:29 external sort the and we're still 00:32 looking at enhancing merge sort the part 00:38 that remains here is the second phase of 00:40 merge sort in which we merge together 00:42 runs now we've seen that when we use the 00:49 loser tree method to generate runs we 00:51 end up with runs of different lengths 00:52 and then we will use the Huffman tree 00:57 construction to figure out an optimal 00:59 order in which these run should be 01:01 merged and now we need to look at how do 01:03 we actually take say K runs if we're 01:06 doing a KY merge and merge these 01:07 together into one run all right so this 01:14 is a slide we had up when we looked at 01:16 the initial version of merge sort where 01:19 we said that to improve the second phase 01:22 we wanted to reduce the number of merge 01:24 passes and at that time we were dealing 01:27 with runs of the same size and so we 01:30 thought about using a higher-order merge 01:32 and that translated into a number of 01:36 passes that was given by log base K of 01:40 the number of runs of course now we're 01:42 not going to use this idea of making 01:44 passes instead we're going to use a 01:46 merge tree as determined by a Huffman 01:48 tree all right so still using a 01:57 higher-order merge will reduce the total 01:59 cost of the much okay so we'll use a 02:04 half one tree of order K now 02:14 we of course need to select K but once 02:17 you have selected K we do want to be 02:19 able to overlap the input/output and 02:22 internal merging that takes place and 02:24 for that to happen we need to have two 02:27 disks as was the case for similar 02:32 overlapping during run generation and 02:35 then we also need to be able to take our 02:40 memory and organize it so that it's 02:43 possible to perform these three tasks in 02:46 parallel okay we want to be able to read 02:49 from our disk and while we are reading 02:52 we want to be able to write parts of 02:54 runs that are generated and we want to 02:57 be able to perform internal merging or 02:59 generation of the run so we want to be 03:02 able to do these these these three 03:04 things in parallel and this is a diagram 03:07 that we've used in a previous lecture 03:09 and basically down here is the 03:12 synchronization step where when one of 03:15 these three things finish you kind of 03:17 wait for the other two to finish and 03:19 then you proceed to the next cycle 03:24 alright so in order to be able to do 03:29 input/output and merging in parallel 03:33 we're going to need two output buffers 03:36 okay you're going to need one in which 03:38 you're going to generate the next block 03:40 for the run and another one that's going 03:43 to be the current block that's being 03:46 written out okay so you're going to need 03:48 two output buffers okay just call them 03:53 output buffer 0 and output buffer 1 so 03:56 we might for example be filling up this 03:58 buffer using our CPU doing internal 04:01 merging it'll be writing this one out to 04:03 disk and then we could interchange these 04:05 two we could be filling this one and 04:07 then writing this one out ok so we need 04:11 two output buffers we also need input 04:16 buffers so if you're going to do a K way 04:19 merge we need at least one buffer load 04:22 from each of these K runs so that's K 04:25 input buffers 04:26 and then another buffer into which 04:27 you're reading the next buffaloed so you 04:30 can keep the read active so you need at 04:33 least k plus 1 input buffers but k plus 04:38 1 is really not enough ok if you just 04:41 have k plus 1 you can come up with the 04:43 example to show that no matter how you 04:45 fill this extra buffer you can find 04:47 yourself in a situation where you don't 04:50 have enough data to proceed with the 04:53 merge so you have to wait for input it 05:04 turns out that you really need to have 05:06 at least two k buffers to get things to 05:09 work right so we need at least two K 05:14 buffers and four 2k to work you need to 05:19 be careful about how you use these 2k 05:21 buffers so if you take a 2k and allocate 05:25 two buffers per run so this kind of a 05:29 static allocation of buffers then the 05:31 text has an example that shows that you 05:34 will run out of input in a run so that 05:38 you have to wait for replenishing this 05:41 run to continue with the merging you 05:45 really need a dynamic allocation of 05:47 buffers will you do some kind of 05:49 prediction which says based on the data 05:53 you currently have in a memory you 05:55 expect to run out of data in run three 05:57 first and so we could bring in the next 06:00 buffaloed for run three okay and then 06:04 when it's time to read again you look at 06:06 the last records read from all of your 06:08 runs and look at which one is going to 06:10 run out first that's the one with the 06:12 smallest key and then you bring the next 06:14 buffaloed from that run okay so you 06:17 predict which one is going to exhaust 06:19 first and whenever it's time to read 06:21 another buffaloed you read from that run 06:29 all right so 06:32 we're going to use a predictive strategy 06:34 and using a predictive strategy we'll be 06:36 able to get by with two K buffers the 06:40 reason we want to minimize the number of 06:41 buffers of course is the order of the 06:45 merge you use is constrained by the size 06:49 of a block which is the same as the size 06:51 of a buffer and also by the size of 06:54 memory that you have so to use the 06:58 highest possible degree of merge you 07:00 need to use the smallest number of 07:02 buffers all right so if you do a static 07:09 two input buffer it doesn't really work 07:13 2k are not enough and we will be 07:17 allocating buffers in a dynamic fashion 07:19 on an as-needed basis so it will be 07:23 possible that at one time we may have 07:27 like k plus 1 buffers assigned to one 07:31 run and the remaining runs have only one 07:33 buffer each assigned to them and at some 07:35 other time you may have two per run at 07:38 some other time you may have some other 07:39 distribution of buffer allocation 07:41 anything is possible okay so as I said 07:48 we will determine which one is going to 07:50 exhaust first and that's done by taking 07:53 a look at the last record that's been 07:54 read from each run the fellow with the 07:58 smallest last record read is the one 08:01 that will exhaust first 08:07 in case of a tie we will need an 08:11 enforceable tiebreaker which is used by 08:13 our k way merge to predict which one is 08:15 going to exhaust first okay all right so 08:18 we can predict which fellow will exhaust 08:19 first because of the tiebreaker and then 08:22 we will read from that run when the next 08:25 opportunity to read arises all right so 08:35 to manage our buffers we will employ Q's 08:42 okay so I've got if you're doing say in 08:46 this case you're doing a nine way merge 08:48 K is nine okay 08:50 so have mine Q's of buffers this is from 08:54 run number zero run one run two up 08:56 through run eight okay each queue can 08:59 have some number of buffers in it you 09:02 have a front of Q pointer and a rear of 09:04 Q pointer okay so in this particular 09:06 case in this configuration run zero has 09:09 three buffaloed of data in it in memory 09:12 run one has one buffer load of data run 09:15 two as one buffaloed run four has to 09:17 buffer loads run six has three buffer 09:20 words the number of buffers assigned to 09:24 a run will vary in time depending upon 09:28 need okay so if it's time to read a 09:31 buffaloed we've take a look at the last 09:33 record here in this buffer the last one 09:37 out there the last one here the last one 09:39 there and so on and see which has the 09:41 smallest last key okay and that's the 09:44 run for which we will run for which will 09:47 read the next buffer load and then that 09:49 buffaloed will get attached to the end 09:51 of the queue all right so we have two 09:55 output buffers okay if this is the 09:58 current active output buffer then we'll 10:00 be merging from the front of these runs 10:02 into the current active output buffer 10:10 all right so this is the setup for the 10:13 buffers okay we have queues of buffers 10:16 we have K such queues and then you have 10:24 some number of input buffers that are 10:26 currently not in use they could be 10:28 sitting here in a pool of free buffers 10:34 all right so when it's time to read a 10:36 block we will get a free buffer from 10:39 here and start reading into that when we 10:42 finish the read that buffer would be 10:45 attached to the end of the appropriate 10:46 queue when you're merging you merge from 10:50 the front of these queues if you run out 10:53 of way if you finish using a buffer then 10:57 you move to the next buffer in the queue 11:00 okay if the output buffer becomes full 11:03 you stop merging you go into a 11:06 synchronization mode all right so to 11:19 start the system we will initialize kqs 11:22 of input buffers each queue will have 11:23 only one buffer in it so you read in one 11:26 block load from each of the K runs that 11:28 is there are to be merged 11:36 right so read in one block load from 11:38 each of the K runs we have a total of 11:42 two K input buffers okay so of the 11:45 remaining K buffers output K minus one 11:48 into the pool of unused buffers I set my 11:53 active output buffer to be buffer zero 11:54 and then I still have one input buffer 11:58 that's not here and that's not been read 12:00 into I initiate a non-blocking read into 12:04 that input buffer from the run that is 12:06 going to exhaust first okay all right so 12:16 that's the initialization step alright 12:24 then I'm going to have a method called K 12:28 way merge which merges from queues of 12:31 buffers into the current active output 12:33 buffer so this fellow is going to stop 12:40 merging when the output buffer becomes 12:42 full or when the end of run marker is 12:47 merged into the output buffer so assume 12:50 that each run has an end of run marker 12:52 which is say a large key value and when 12:55 that's merged into the output buffer 12:57 that signals the end of merging these K 12:59 runs okay so K way merge stops when one 13:05 of these two things happens either the 13:07 output buffer becomes full or you're 13:09 finished with the merging of the K runs 13:18 if this event doesn't occur okay then if 13:26 an input buffer becomes empty you move 13:29 to the next buffer in the queue and 13:30 continue emerging from there okay so if 13:38 this event doesn't occur and if an input 13:41 buffer becomes empty you advance to the 13:44 next buffer in the queue when you 13:46 advance to the next buffer in the queue 13:47 the empty buffer is put into the pool of 13:49 three buffers okay all right questions 13:59 about how the K way merge works all 14:10 right so now to merge K runs okay after 14:15 the initialization I have a loop which 14:17 says do a K way merge okay and this is 14:21 going to stop either when the output 14:23 buffer is full or you're finished with 14:25 the merging of the K runs so this time I 14:34 have a synchronization step I wait for 14:37 any active input output to complete the 14:41 first time through you only have an 14:43 active input because as part of the 14:46 initialization I started a non-blocking 14:48 reading but in subsequent rounds we will 14:52 have active both input and outputs 14:54 taking place okay so I have a 14:57 synchronization step here wait for the 14:59 active input output to complete at this 15:05 time if there was an input taking place 15:10 you have a full input buffer we attach 15:12 it to the queue for its run so you put 15:14 it at the end of that queue 15:19 you determine which run will exhaust 15:22 first we're looking at the last keys and 15:24 all the cues and if there's more input 15:29 for this run it may be that you've 15:31 finished reading all of the blocks for 15:33 all of your K runs okay so if you 15:35 haven't done that then we're going to 15:38 initiate your read of the next block for 15:40 that run okay so so long as there's more 15:42 input you initiated read for the next 15:45 block and I initiate a right of the 15:54 active output buffer okay so these are 15:57 non-blocking reads and writes I switch 16:02 the roles of the output buffers and then 16:11 I repeat and you keep doing this until 16:17 you're finished with merging the K runs 16:26 all right so we first had the 16:28 initialization step you read in the 16:30 first block from each of the K runs you 16:33 take K minus one buffers put them into 16:35 the free pool you take the remaining one 16:38 buffer and initiate a read into that 16:40 buffer using our predictive strategy and 16:44 then we basically repeat this set of 16:46 statements until all K runs have been 16:49 merged you do a K way merge which merges 16:55 just one output buffer load okay so as 17:00 soon as the output buffer is full this 17:02 thing stops or if you're finished with 17:04 all the K runs then also it stops and 17:07 that time we do a synchronize if there 17:10 was an input buffer being read you put 17:11 it into its queue determine where to 17:15 read from next start a read if there is 17:18 more data to read write the buffer that 17:20 was full that was filled out here switch 17:24 the roles of the buffers and then you 17:26 repeat so long as you haven't merged in 17:30 the end of run market 17:40 okay all right questions about how we do 17:43 the mud 17:58 all right so this is the strategy okay 18:01 now we don't really know that this 18:05 strategy is going to work we don't know 18:10 whether it's going to work or not 18:12 because we don't know whether 2 K 18:13 buffers are sufficient for this scheme 18:17 to work in principle if you had an 18:21 infinite supply of buffers and this will 18:23 guarantee to perform the task to see why 18:30 or where things can go wrong 18:35 this is the K way merge method okay 18:39 these were these steps in there okay and 18:42 here there is a step which says that if 18:50 the output buffer is not full and an 18:51 input buffer gets empty you want to 18:53 advance to the next buffer load in the 18:55 queue okay well maybe there is no next 19:00 buffaloed in the queue that became 4 in 19:04 that particular cube okay so this is a 19:09 potential point of failure a buffer has 19:12 become empty you move down to the next 19:15 buffer in the queue but that buffer is 19:17 not there okay so in that case our 19:25 algorithm would fail right these are 19:28 really not points of failure things can 19:30 go wrong here you're merging from the 19:32 active buffers you stop when a something 19:37 becomes full or you've put the end of 19:39 run marker and you can't fill for those 19:42 reasons but this is where you can 19:44 potentially fail okay so that's a point 19:48 of failure in our algorithm and we have 19:50 to establish that whenever you come to 19:54 this point in the algorithm there is a 19:56 next buffer in the queue 20:05 right another place where things can go 20:08 wrong right you already seen something 20:09 could go wrong out here okay another 20:12 place where things could go wrong is 20:15 here okay so over here we say initiate a 20:21 read of the next block do you initiate 20:24 the read of the next plot you must have 20:26 a free buffer so if all of your buffers 20:30 are sitting in queues there's no way for 20:32 you to initiate that read okay there's 20:36 got to be a free buffer in the pool of 20:38 free buffers for you to do this 21:03 or do you see any other potential points 21:06 of failure in the algorithm we've 21:08 described 21:22 you should be able to convince yourself 21:24 that these are the only two potential 21:26 points of failure one is you don't have 21:28 the next block when you need it and the 21:31 second is you don't have a free buffer 21:33 to read into when you need to read okay 21:37 so we need to prove that neither of 21:39 these two conditions can arise all right 21:46 so let's look at the first of these two 21:52 you're doing the K way merge a buffer 21:55 and the Q's become empty and you're 21:57 advancing to the next buffer in its non 21:59 present what we'll show is that if this 22:09 failure were to occur then we will use 22:12 two different analyses to arrive at 22:16 different numbers for the amount of data 22:18 in memory that's showing a contradiction 22:22 okay so we'll arrive at a contradiction 22:25 by using two valid analyses showing that 22:29 the amount of data available to KY merge 22:31 is different from each of these two 22:33 analyses by amount of data available to 22:41 K by merge what we mean is the data in 22:45 the input buffer Q's and the data in the 22:49 active output buffer we don't include 22:54 the data in the buffer that is currently 22:57 being written out and the buffer that's 22:59 currently being read into okay so those 23:03 are not available to care very much 23:04 what's available to KY merges all the 23:07 data in the input buffers and all of the 23:09 data in the current active output buffer 23:12 so that's what we're going to count and 23:16 we'll show that with one method of 23:18 counting we end up with one number for 23:20 this amount of data and with another 23:23 method of counting under the assumption 23:25 of failure we end up with a different 23:26 number and so the assumption of failure 23:30 must be invalid 23:36 all right so that's where we are in our 23:39 code that's K way merge so the first 23:49 time through okay 23:51 we read in K input buffers okay so the 23:56 amount of data available to K way merge 23:58 is there are K input buffer loads and 24:01 the queues and that's about it and so 24:07 the first time through there's exactly K 24:09 buffer loads available to care very much 24:11 okay then the next time through we're 24:15 going to take a buffalo that's full 24:17 we're going to write it out okay but 24:21 we're going to take an input buffered 24:23 who's read was started in the previous 24:25 round and attach it to a queue okay so 24:27 one buffalo goes out one buffaloed comes 24:29 in so the amount of data remains k 24:36 alright so on the second iteration okay 24:40 so let's say we've done the first cycle 24:42 alright so we started with k then an 24:45 output buffer became full you came down 24:47 here wait for i/o to complete we take 24:49 this input buffer and put into the queue 24:51 so now the data becomes k plus 1 okay 24:54 all right determined to exhaust we start 24:56 a read then we started right ok and so 25:00 the output buffer that became full is 25:02 now the is a write buffer okay and you 25:06 switch the roles so the active output 25:10 buffer is empty there are K data loads 25:14 all told then there's no one got added 25:18 out here and one got removed out here so 25:22 the total amount of data is still K okay 25:26 so really what happens is on subsequent 25:29 iterations input and output take place 25:31 at the same rate you write one you read 25:35 one okay so the amount of data stays K 25:37 in here except towards the end because 25:40 towards the end there is no more data to 25:42 read okay so that time you're doing more 25:46 writing and less 25:48 now we couldn't have reached that state 25:51 where you you're doing more writing than 25:54 reading at the time of failure because 25:58 if you're trying to advance to the next 26:02 block okay then you could not have 26:04 output the end of run maka okay because 26:09 the moment you output the end of run 26:11 maka you stop the merge 26:12 so you haven't yet output the end of run 26:15 marker to the output buffer so you 26:17 should have been reading and writing at 26:18 the same right up to this point 26:20 otherwise you shouldn't be advancing 26:22 because our predictive strategy okay so 26:27 input and output are taking place at the 26:29 same rate up to the time of failure and 26:33 therefore you always had K block loads 26:35 available to K way much okay so the 26:40 first line of reasoning says that you 26:43 got exactly K block loads available to K 26:45 way merge at the time of failure all 26:51 right so the next line of reasoning is 26:56 okay and the time of failure the output 27:00 buffer can't be full the end of run 27:02 marker hasn't been merged okay so you've 27:05 got less than one buffaloed in the 27:07 active output buffer there is one queue 27:16 that has become empty the remaining K 27:20 minus one queues cannot have a second 27:23 buffer attached to them because of the 27:26 predictive strategy okay before you 27:28 would get a second buffer for somebody 27:30 you would have got a replenishment for 27:31 this guy so the other fellows must have 27:34 only one buffer in them okay 27:37 and so at worst those each of those one 27:42 buffers could be full so there are K 27:46 minus one of those so in the other K 27:48 minus one runs you have at most K minus 27:51 one buffaloed of data okay because 27:53 nobody can have a backup buffer behind 27:55 them okay so I've got less than one 27:59 buffaloed in the active output 28:01 and at most K minus one buffer loads in 28:04 the input queues for the other K minus 28:06 one runs and if you add that up that's 28:09 less than K buffaloed so the previous 28:19 reasoning said that I got exactly K 28:21 buffaloed x' and this reasoning says 28:23 I've got less than K buffaloed x' and so 28:25 have a contradiction in my analysis so 28:29 the assumption that I made of failure 28:31 must be incorrect okay all right so we 28:37 cannot have a failure of this type when 28:41 we use the predictive strategy or any 28:50 questions about that proof 29:03 okay well I said this proof makes use of 29:06 the predictive strategy and that's 29:08 crucial to not running out of data yeah 29:10 yeah oh this one here right so but by 29:20 definition the total data available to 29:21 the merge is the data in the active 29:23 output buffer and the data in the input 29:25 queues I'm also taking into 29:32 consideration the output buffer yes okay 29:34 so the output buffer is not full so I've 29:38 got less than a buffaloed there I've got 29:41 kqs 1q has become empty that's the one 29:44 that caused the failure the other K 29:46 minus one queues can have only one block 29:48 in them okay you couldn't have bought a 29:51 brought in a backup block for any of 29:54 those queues because of the predictive 29:55 strategy okay so the other K minus one 29:58 queues have at most K minus 1 block 30:00 loads so you add these two together you 30:03 get less than K okay 30:15 all right let's look at the second point 30:18 of failure over here we'll make use of 30:22 the fact that we have two K buffers 30:27 alright so this is the place where we 30:30 initiate a read for the next block and 30:37 we're postulating that we don't have a 30:39 free buffer to read into that means that 30:45 all 2k of our buffers are sitting in 30:47 queues at this time all right so suppose 30:53 there's no free input buffer okay again 30:58 we'll perform two different analyses 31:01 based on this assumption one of them is 31:04 going to show that there's exactly K 31:07 plus 1 buffaloes in memory this is not 31:10 data available to K very much but total 31:12 amount of data in memory at this time at 31:15 the time of failure and the other one 31:18 will show that you got more than k plus 31:19 1 buffaloes again arriving at a 31:22 contradiction and so the assumption that 31:25 there's no free input buffer is 31:27 incorrect 31:32 oh well how does that show that we 31:39 haven't done it yet we have yet to do 31:41 these two analyses okay all right so we 31:44 haven't done the analysis yet yeah okay 31:47 all right so okay 31:57 well this is where we are in our program 32:01 okay so at this time there's no read or 32:05 write taking place everything is 32:06 finished we're done with the 32:08 synchronization that took place over 32:09 here the buffer that you were reading 32:12 has been put into a queue also up to 32:18 this time input and output must have 32:19 been taking place at the same right okay 32:22 if it hadn't that means you'd stop doing 32:25 input at some point you stop doing input 32:27 only if you read in all the data okay in 32:29 which case you won't be trying to read 32:31 in data now okay so up to at this point 32:33 input out what's going on at the same 32:36 right now the first time you get here 32:41 okay so we had K blocks of data that 32:45 were read in we read in another block 32:48 out here k plus one okay so when you get 32:51 here there are total of k plus one 32:52 blocks of data then you initiate a right 32:58 out here you initiate a read so you 33:00 write one block you read one you come 33:02 around again you synchronize one block 33:05 got written one got read so you still 33:07 have k plus one okay 33:10 so up to the point of failure there are 33:12 exactly k plus one blocks of data when 33:15 you come out here right any questions 33:19 about that 33:24 all right let's do the second analysis 33:28 okay so at the time we come to this line 33:35 okay so at this point our active output 33:40 buffer is full okay there's still more 33:43 data so we haven't merged the end of run 33:46 marker you can't have a partial buffer 33:47 okay so the output buffer is full and 33:55 now let's focus on the input queues it's 34:01 possible that one of the queues has an 34:03 empty buffer in it that we haven't yet 34:06 advance to the next buffer in that queue 34:09 because our strategy for K very much 34:11 says as soon as the output buffer 34:13 becomes full you stop and output buffer 34:16 may become full exactly when an input 34:18 buffer becomes empty so we don't free 34:21 that and put buffer for our strategy 34:25 okay so it's possible that one queue has 34:27 an empty input buffer at this time but 34:31 it's not possible that two queues have 34:32 an empty input buffer at this time so 34:36 one queue has an empty input buffer 34:38 possibly the remaining K minus 1 input 34:45 queues have a non-empty first buffer 34:52 then in addition to these K we got one 34:55 empty buffer here of K minus one 34:57 non-empty first buffers here there are K 35:00 other buffers which are sitting in 35:02 queues and those are second third fourth 35:04 buffers those have to be full buffers 35:19 right so if you add up okay we got K out 35:24 here we got one out there so that's K 35:27 plus one and since K is at least two you 35:31 got at least one non-empty first buffer 35:32 here okay so you have more than k plus 35:36 one I'm not saying that K minus 1 input 36:05 queues are empty or not empty in these 36:10 two statements or lines I'm making a 36:14 statement about the first buffer in each 36:16 of the kqs okay 36:18 so each queue has at least one buffer in 36:21 it and potentially more so I'm looking 36:25 only at the first buffer in each of the 36:27 K cues and I'm saying that one of these 36:29 first buffers may be empty the other K 36:33 minus 1 first buffers cannot be empty 36:35 okay in addition to these K buffers 36:39 since I've got a total of 2 K buffers in 36:42 my system input buffers there are 36:44 another K buffers none of these are in 36:48 the pool of free buffers so these K must 36:51 be in the queues okay and if they're in 36:54 the queues they have to be backup 36:55 buffers which must be 4 okay and this is 37:00 that's where this line comes from which 37:02 says the remaining K input buffers these 37:04 are backup buffers they have to be folk 37:06 and so you got K full buffers in the 37:08 queues 37:09 [Music] 37:13 that's why these kr non-first buffers 37:15 okay okay but we don't know which queue 37:19 they're in right okay so I've got K of 37:22 these non-first buffers that are full I 37:24 got this one that's full that's K plus 1 37:26 that one's empty doesn't do me any good 37:28 and I've got at least one buffer here 37:30 that is not empty 37:32 so I've got at least one record more 37:34 than k plus 1 buffers okay so I've got 37:38 more than k plus one buffer loads and 37:40 that contradicts what I had on the 37:42 previous slide which said I've got 37:43 exactly K plus 1 buffer works ok so if 37:49 you have two K buffers then you cannot 37:52 have this kind of failure okay so if you 37:57 use a predictive scheme you can't get 37:58 the first kind of failure if you have 38:00 two K buffers you can't get the second 38:02 kind of failure because here okay we 38:18 want to initiate a read of the next 38:19 block and the only way this can fail is 38:23 if the pool is empty okay so the 38:27 assumption is like here no free input 38:32 buffer 38:38 all right other questions 38:47 alright so using the predictive scheme 38:49 and 2k buffers is sufficient to have our 38:53 mood strategy work all right there to 39:06 minimize the wait time what you would 39:07 like of course is that the time to fill 39:09 the output buffer should be roughly 39:12 equal to the time to read a buffaloed 39:14 should be roughly equal to the time to 39:16 write a buffaloed of course in practice 39:19 you don't have any control over this 39:21 because your block size is determined by 39:24 the disk format and the time to fill 39:28 above was determined by your CPU speed 39:33 so even if you change K this doesn't 39:41 really change because if you use a 39:42 looser tree there we saw that the time 39:44 to merge is insensitive to the order of 39:46 the merge 39:48 so this is what you'd like to have but 39:51 you don't have a whole lot of control 39:52 about adjusting parameters to make those 39:55 equal now typically you're going to be 40:06 performing many k way merges so at each 40:09 internal node of the Huffman tree you'll 40:11 be doing in K way merge the way we've 40:14 written the merge we have to first read 40:17 the first block of each of the K runs 40:19 that are being merged and so at that 40:22 time you're only doing input you're not 40:26 the overlapping input with any output or 40:28 internal merging now you can arrange so 40:34 that for subsequent K way merges you 40:37 actually read the first block of the 40:38 runs early so you don't have this period 40:42 the startup period and for that what you 40:46 do is you change this if statement in 40:50 our merge program okay in that loop we 40:55 change it to say 40:58 if you have more input from the run well 41:00 then you certainly go and get the next 41:02 block but if you don't that would mean 41:06 that you've read in all the records or 41:08 all of the blocks for this current que 41:10 way merge okay because of the predictive 41:14 scheme if there is nothing from the run 41:18 that's going to exhaust first there's 41:19 nothing from any of the other runs okay 41:22 so in that case you initiate a read of a 41:28 block for the next highway merge so that 41:36 way it's only the first que way mode you 41:38 do that has the startup cost where you 41:41 have to read in K blocks while you 41:43 aren't doing in your writing or merging 42:02 all right any questions 42:15 all right Google stop here then