Transcript 00:10 okay in the last lecture we saw how to 00:14 generate runs whose size was larger was 00:19 than the size of memory and potentially 00:22 could be very large you could end up 00:25 with a single run independent of the 00:28 number of Records you were trying to 00:29 sort troy did they were reasonably close 00:31 to being in sorted order 00:33 now that method we saw resulted in runs 00:36 of variable length and once you have 00:40 variable length runs then the simple 00:43 strategy of just making multiple passes 00:45 over the data or over the runs to merge 00:48 them doesn't yield an optimal merge 00:50 strategy so today we want to look at how 00:56 we might figure out the optimal way to 00:59 merge a collection of runs of different 01:01 lengths this is the example we saw last 01:05 time 01:05 and basically this tree here represented 01:08 the multi pass key and the cost for that 01:11 was 44 we saw last time and the other 01:13 one represented a non multi pass scheme 01:17 and that had a better cost of 42 all 01:23 right so to figure out the best way to 01:26 merge a collection of runs will 01:28 introduce a term called the weighted 01:32 external path length to be a property of 01:36 a binary tree and then we'll see that 01:40 the weighted external path length of a 01:43 binary tree actually gives us the cost 01:45 of using that binary tree for merging 01:47 together a collection of runs and so 01:50 what we want to do is we want to 01:51 minimize the weighted external path 01:53 length of the merge tree that is being 01:55 used okay so if you think back to the 01:59 previous slide the multi pass merging 02:07 was described by this binary tree and 02:09 then have a different 02:11 which schemed described by the binary 02:13 tree on the right okay and really any 02:17 binary tree that you come up with we 02:19 describe a merging pattern so our 02:25 objective becomes one of finding the 02:26 right binary tree and the parameter that 02:30 you want optimize on that binary tree is 02:33 something called the weighted external 02:34 path length okay all right so the way 02:38 that external path length of a binary 02:39 tree is you take the weight of an 02:44 external node okay so you have a binary 02:47 tree external nodes of this or whatever 02:50 they would otherwise be another length 02:51 you put an external node and the 02:53 external nodes have weights so you take 02:56 the weight of an external node and 02:57 multiply it by the distance of that node 02:59 from the root of the tree and add that 03:02 up across all external nodes we call 03:04 that the weighted external path length 03:08 so for example here for external nodes 03:12 shown in white and their weights are 03:14 four three six and nine so you look at 03:16 those as the length of runs the rated 03:21 external path length of this binary tree 03:23 will be four times the distance of this 03:25 node from the root so this one is one 03:27 two hops from the root so 4 times 2 plus 03:31 3 times 2 plus 6 times 2 plus 9 times 2 03:37 okay that works out to be 44 which also 03:43 was the merge cost of this tree and 03:48 that's not just a coincidence okay 03:52 the cost of this node is 7 as far as 03:55 merging goes the cost of that node is 22 03:58 for merging and the cost of that was 15 04:00 the add up those 7 15 22 we got 44 and 04:04 and the reason that song equals what you 04:06 get that way is the 4 out here ok these 04:10 four records 04:11 emerged at this node they're also merged 04:15 again at that node ok so these four 04:18 records are a red in here they red in 04:22 there they merged here merge there 04:24 out here written out there so four times 04:27 two is its contribution to the total 04:30 cost similarly these fellows are read in 04:33 twice written out twice much twice so 04:35 you take three times two okay and that's 04:38 why when you take the weight of an 04:40 external node and multiplied by its 04:42 distance from the root and add up you 04:44 get the same thing as we got when we 04:46 said over here there's a cost of 7 and 04:48 there's a cost of 15 on this merge cost 04:50 22 so the the weighted external path 04:55 length is exactly equal to the merge 04:58 cost look at it for the other tree same 05:02 thing here four times this distance from 05:05 the root this is three hops from the 05:07 roots of four times 3 3 times 3 6 times 05:11 2 and 9 times 1 and that works out to 42 05:15 just again the merge cost of that tree 05:17 again that's not a coincidence because 05:20 these four records here will be read in 05:23 once here again here again they're three 05:26 times okay and written out three times 05:28 in most three times these nine are read 05:32 in only once at this node and written 05:35 out once and which one so it's nine 05:37 times one all right so does everybody 05:42 see the connection between weighted 05:43 external path length and the merge cost 05:45 over binary tree also every binary tree 05:50 defines a merge pattern so every binary 05:55 tree that has four external nodes will 05:58 define a possible way to merge four runs 06:01 together okay any questions about that 06:08 all right all right so objective becomes 06:16 then given the weights of the external 06:20 nodes okay so you run the first phase of 06:24 merge sort the run generation phase and 06:26 let's suppose it produces 83 runs 06:29 following that you know the weights you 06:31 know the length of the runs so you're 06:34 given the number of runs 83 you given 06:35 their weights you want to find 06:38 a binary tree with minimum weighted 06:40 external path length it's got to have 83 06:43 external nodes and the weight of these 06:45 nodes is known to you right so that's 06:49 the problem we want to solve okay before 06:52 looking at a solution to the problem 06:54 let's take a look at a couple of other 06:56 applications for binary trees with 07:00 minimum weighted external path length 07:03 one of the application you look at is - 07:06 message coding and decoding and the 07:09 other one is - lossless data compression 07:14 all right so suppose you want to send 07:16 messages so here I've got a collection 07:19 of n messages labeled m 0 through m n 07:25 the messages are known ahead of time 07:28 these are standard boilerplate messages 07:31 they don't change in time so somebody 07:34 comes to you and says send message 3 to 07:36 my friend out here or send message 6 and 07:38 message 8 and the receiving station they 07:44 know the content of a message but just 07:46 knowing the index so if I know message 8 07:49 came I know what it is I can print out 07:50 the full text right so the messages are 07:54 known they don't change in time and the 07:59 transmitter and receiver given the index 08:02 know what the messages so they don't 08:08 really transmit the message okay so the 08:10 message may be long maybe several 08:12 kilobytes long megabytes long instead 08:14 they transmit a code which is short and 08:19 then for the user at the other end the 08:22 full message can be printed out all 08:27 right so our objective is to come up 08:30 with the codes that should be used and 08:33 to come up with the codes we assume that 08:36 we know the frequency with which each 08:38 message is going to be sent so some 08:41 messages are sent very infrequently 08:43 summer sent very frequently and we know 08:45 the frequency or relative frequency with 08:48 which messages will be sent 08:51 and we want to develop the optimal codes 08:53 to use so optimal here means you want to 09:02 optimize the transmission costs and you 09:05 want to optimize the decoding cost all 09:14 right suppose you have four messages and 09:16 the frequencies are 2 4 8 and 100 now 09:23 you could use two bit codes okay so each 09:26 message has a code of the same length so 09:29 we get four codes 0 0 0 1 1 0 and 1 1 so 09:35 if you transmit 0 0 the receiver knows 09:37 it was the first message if you transmit 09:40 1 1 the receiver knows it's the last 09:41 message the transmission cost is the 09:49 length of the code times the frequency 09:52 of the code or the way I've got it here 09:54 is really frequency times the length so 09:57 the first one is sent with the frequency 09:59 of 2 the length is 2 so it's going to 10:02 constitute 4 and then the second one is 10:04 sent 4 times length of 2 that's 8 bits 10:06 of transmission the last one has sent 10:08 100 times at a cost of 2 2 bits for 10:11 transmitted say a total transmission 10:15 cost works out to be 228 if you use this 10:18 code 10:23 okay to to do the decoding at the 10:28 receiving end we'll make use of a binary 10:30 decode tree which looks like this so you 10:36 look at the first bit in the code if 10:39 it's a zero you move to this side you 10:41 look at the if it's a one you move to 10:43 that side and here you look at the 10:44 second bit in the code if it's a zero 10:46 you come here otherwise it's the one you 10:48 go over there okay so with zero zero you 10:51 get m 0 you get 0 1 you get M 1 1 0 you 10:54 get em 2 1 1 you get M 3 ok all right so 11:01 the cost of decoding message m0 is 2 11:06 because I got to do two branches in my 11:08 decode tree to decode M 1 is also a 11:11 constant of two you got to do two 11:13 branches and same thing for each of the 11:15 others so the decoding cost is the 11:23 frequency times the distance from the 11:26 root 11:36 well the decoding cost will always be 11:39 equal to the transmission cost and it 11:41 will always be equal to the weighted 11:42 external path length of the decode tree 11:47 so to find optimal codes you need to 11:50 find an optimal decode tree well which I 11:55 mean find a binary tree whose weighted 11:58 external path length is minimum all 12:02 right so our problem of merging runs of 12:04 variable lengths optimally is identical 12:07 to the problem of finding optimal codes 12:12 any questions about that right it's an 12:31 observation here just like every binary 12:34 tree with n external nodes could be used 12:36 to merge n runs every binary tree with n 12:40 external nodes defines a code pattern 12:42 for n messages so you just label the 12:46 edges 0 and 1 if it's a left edge or a 12:48 right edge and you read the labels on 12:50 the way down that gives you the code to 12:53 use now the reason the well I guess we 13:01 have a decoding cost is against the same 13:04 thing you just take the weight of the 13:07 node of the frequency times its distance 13:09 from the root is its contribution to the 13:12 total decoding cost so in this case it 13:14 works out to 144 in transmission same as 13:20 the transmission costs the weighted 13:21 external path length 13:25 all right let's look at another example 13:28 okay this time I've got ten messages m0 13:35 through m9 again I can figure out the 13:41 codes by just following the path from 13:43 the root to any message and read off the 13:46 labels okay 13:47 so zero 1 1 is the code for m500 1 is 13:51 the code so 0 0 1 0 would be the code 13:54 for m2 so just follow a path from the 13:57 root to the external node that gives you 13:59 its code now the reason that every 14:10 binary tree defines a code is that when 14:17 you use a tree no path to an external 14:20 node is a prefix of another path to an 14:23 external node so for example we have a 14:27 path out here which is 1/1 but there's 14:31 no other path from the root to an 14:32 external node which is 1 1 followed by 14:34 something that property is important in 14:40 decoding ok 14:44 because when the messages are sent 14:46 you're sending long strings of binary 14:49 bits okay see you when you send them 14:51 line you send one one but then 14:54 immediately after that you might send m0 14:56 so you'll send 0 0 0 and another 0 but 15:01 there's no marker in between those two 15:03 messages so if so by looking at 1 1 ok 15:11 you follow this decode tree you do a 1 1 15:14 you end out here you notice m9 and then 15:17 you know that the next bit defines the 15:18 start of the next message and you can 15:24 make that conclusion only if no code is 15:26 a prefix of another prefix 15:30 so when you're going to variable length 15:32 prefixes in order for things to work and 15:35 you're sending streams of long streams 15:38 of codes one after the other no message 15:41 or no proof no code should be a prefix 15:43 of another code and using trees to 15:47 define the codes guarantees that we'll 15:53 see that when we look at the lossless 15:55 data compression example how we make use 15:57 of that property 15:59 all right so in lossless data 16:01 compression you may for example have a 16:04 long piece of text it says composed of 16:07 characters and you want to read write 16:12 that piece of text using a smaller 16:16 number of bits so for example suppose we 16:22 have text that's made up of four 16:24 characters only a B C and D and let's 16:33 suppose you have a string that has ten 16:36 A's and at five B's in it 100 CS and 16:39 it's got 900 DS 16:47 now I could write my string since there 16:50 are four characters I could say well I 16:52 can get by with two-bit codes that will 16:55 give me four different codes so use 16:57 codes of the same length and when codes 17:02 are the same length it doesn't matter 17:03 which code is assigned to which 17:05 character the frequency doesn't play a 17:07 role because all their lengths are the 17:08 same okay so let's say a is represented 17:13 by 0 0 B by 0 1 C by 1 0 and D by 1 1 17:16 then the length of the string using this 17:20 coding becomes there are 10 A's serves 17:24 10 times 2 5 B's 5 times 2 100 C's 100 17:28 times - 900 DS 900 times 2 so it takes 17:33 you 2030 bits to represent this strength 17:38 plus of course you need to have a code 17:41 table which Maps the codes to characters 17:45 okay so you have a code table with 4 17:47 entries ABCD okay Oh perhaps in this 17:52 case it could even be implicit right we 17:59 really suppose instead you use a 18:01 variable length code with the property 18:03 that no code is a prefix of another all 18:09 right so the code I'm going to use is a 18:11 is represented by 3 bits 0 0 0 B is also 18:16 3 bits see is a 2-bit code in these are 18:18 1 bit code and here you can see that no 18:25 code is a prefix of another and this 18:28 actually came from a binary tree all 18:32 right so the size of the string is going 18:34 to be there were 10 a is five B's a 18:37 hundred C's and 900 B's so 10 times 3 5 18:41 times 3 100 times 2 or 900 times 1 so 18:46 that works out to 1145 bits plus the 18:49 size of the code table 18:53 okay but ignoring the size of the code 18:55 table the difference between the two 18:59 schemes affords us a compression of 19:02 about 1.8 so using about half the space 19:06 I can represent this strength now this 19:11 of course is a useful representation 19:12 only if from this binary string you can 19:16 recover the string that you started with 19:19 okay you could take any input string and 19:24 represented with 0 bits just throw it 19:25 away which you can't recover what you 19:27 started with okay so you have to be able 19:29 to recover what you started with to have 19:31 a reasonable compression scheme right so 19:36 you have to be able to decompress your 19:40 stuff get back what you started with and 19:41 so for that I'm going to use this decode 19:43 tree yeah yeah 19:53 for some defendants and 19:57 I mean question that was received here 20:28 1.8 is really a function of the 20:30 frequencies of the characters and so to 20:33 come up with the proper codes do you 20:34 have to examine the data that's right 20:37 the code will be data dependent okay and 20:43 so if you look at at lossless 20:49 compression schemes the most common one 20:50 would be like the lzw scheme the codes 20:53 are developed on the fly you examine the 20:55 string left to right and you develop the 20:56 coding it's very data dependent all 21:02 right so so you're right the code is 21:05 data dependent and the only reason for 21:09 representing D by one bit was because it 21:11 was occurring so many more times than 21:12 any other character all right so to do 21:20 the decoding you build a binary tree 21:22 based on the code so if you see a 0 0 0 21:25 it's an a so you're 0 0 1 it's a B or 0 21:27 1 it's a see if you see a 1 it's a deep 21:33 all right so let's suppose you're given 21:35 this string okay so you started off with 21:38 A's B's and C's and then you compressed 21:41 it you ended up with a binary string and 21:42 now you have to recover the original 21:44 string okay so I'll go through this okay 21:48 and go down my decode tree so I have a 0 21:52 0 0 so that ends up here so these first 21:55 three zeros gives me an A okay all right 22:00 so I'll get the a out of that all right 22:03 then I pop back up to the root I see a 1 22:06 so that gets me a D alright then I 22:11 see that one that gets me ad then I see 22:15 a zero so come here us I see another 22:17 zero I come here a one get to another 22:19 external node so that gets me of B right 22:23 then I see a zero and I get a one I get 22:25 a C okay so every time you reach an 22:28 external node you get a character then 22:29 you pop back up to the root there are no 22:32 boundaries in the binary string between 22:35 the different codes and the fact that no 22:40 code is a prefix of another shows you 22:43 that you can do the reverse 22:44 transformation 22:59 okay all right so to maximize the 23:06 compression given a string we'd like to 23:10 have a binary tree with minimum weighted 23:14 external path length and based upon that 23:16 we will develop the codes for the 23:18 different characters in the text that we 23:20 are compressing all right any questions 23:25 about lossless data compression using 23:28 this strategy will be just taking 23:53 very close difference between how you 24:05 develop the code and how you implement 24:07 the actual compression and decompression 24:10 okay so what I'm trying to get across 24:13 here is that you first have to have that 24:16 table which maps from characters to the 24:20 codes and how do you get the codes okay 24:23 and the way you get the optimal codes is 24:26 by thinking in terms of a minimum 24:28 weighted external path length trip now 24:34 whether you go and actually implement 24:35 the decompression using a binary tree or 24:40 just having you know some other data 24:43 structure is a different story yeah it's 24:53 not limited to binary codes you can have 24:55 ternary codes or any degree codes you 24:58 would again be looking for minimum 25:01 weighted external path length trees 25:19 standard shape of binary trees is most 25:21 efficient what do you mean by the 25:22 standard shape okay all right so a 25:29 balanced binary tree 25:33 alright a balanced binary tree for this 25:38 particular example would be bad because 25:40 the frequencies are so lopsided but if 25:44 all the frequencies were the same then 25:46 the balanced binary tree would be the 25:47 best way to go so it really depends on 25:50 the frequencies with which different 25:52 letters appear or the weights of the 25:55 runs as we saw with okay with merging 26:00 runs if all the runs are the same length 26:02 then you use a merge tree which looks 26:05 like a full binary tree it's only when 26:08 the run lengths are very different that 26:11 you end up using a skewed binary tree oh 26:14 is that what you're looking forward yeah 26:17 okay all right other questions 26:22 all right so trees that have minimum 26:26 weighted external path length they 26:27 called Huffman trees and the 26:29 corresponding codes are called Huffman 26:30 codes and you probably heard of Huffman 26:33 codes they were extensively used in a 26:35 variety of applications and we just 26:37 looked at three of them here 26:40 all right so their minimal ready to 26:43 external path length Huffman trees don't 26:46 have to be binary they could be of any 26:47 degree if you're trying to find binary 26:52 trees with your own weighted external 26:54 path length you can do that using a 26:55 greedy algorithm we look at that in 26:57 moment when you want to go to higher 27:00 degree trees you need to have a 27:05 pre-processing step and then run the 27:07 greedy algorithm after that you get the 27:09 best external weighted path length tree 27:14 okay 27:17 let's look at the greedy algorithm for 27:20 binary trees so basically what we're 27:25 trying to do is if you're thinking about 27:27 merging we've had a bunch of runs I want 27:29 to merge the runs together two at a time 27:33 so if I gave you some number of runs 27:37 like six six and I said merge them 27:38 together two at a time if you do it in a 27:41 greedy fashion they're most likely what 27:43 you would do is you would first pick the 27:45 runs of minimum length out of these six 27:48 merge them together gives you a new run 27:50 put it back in your collection now 27:51 you've got five runs and you repeat this 27:53 process till they left with one run so 27:56 each time you have to make a decision 27:57 which two runs to merge you make the 28:00 decision in a greedy fashion find the 28:02 two runs of smallest length merging them 28:04 will cost you the least so you merge 28:05 them and put them back in the collection 28:07 of runs all right so that's the greedy 28:12 algorithm we want to build a tree and 28:15 say start with as many trees as you have 28:18 runs or weights each tree has only an 28:22 external node okay so that represents in 28:25 a initial configuration and in 28:27 stand-alone runs then we're going to 28:32 reduce the number of trees in our 28:33 collection by one by performing a greedy 28:35 step and in this greedy step we select 28:39 two trees that have the smallest weight 28:42 we would combine them together so 28:45 perhaps do a merge or if you're building 28:48 the tree that means take an internal 28:50 node and make these two trees sub trees 28:52 of that internal node we give the newly 28:59 created tree a weight that's the sum of 29:01 the weights of the two trees that were 29:02 combined and put that new tree back in 29:06 the collection of trees right so this 29:10 reduces the number of trees by one okay 29:13 pick two trees of minimum weight combine 29:17 them into one tree by adding a new 29:19 internal node make the two trees here 29:22 sub trees of that the weight of the new 29:24 tree is the sum of the weight of the two 29:26 trees you just combined and then 29:28 put it back in the collection okay so 29:32 this reduces the number of trees by one 29:34 and then you repeat this reduction step 29:37 till you down to one tree let's see an 29:47 example all right so let's have that 29:51 five weights so maybe you got five runs 29:54 and the lengths are 2 5 4 7 and 9 so I 30:00 start with a collection of 5 trees each 30:02 tree has an external node only and the 30:08 weight of a tree is the is one of the 30:11 weights that you started with so in the 30:15 first reduction step I pick two trees a 30:17 minimum weight there will be the two in 30:19 the for take him out of that pool 30:22 combine them so you get a new internal 30:26 node and you give it a weight which is 30:28 the sum of the weights of the two trees 30:30 and then put it back in the collection 30:37 so I've down from five to four then I 30:41 pick two trees a minimum weight so 30:43 there'll be the five and the six 30:45 take him out combine them so the five 30:50 and the six you combine it with a new 30:51 node you get a new tree whose weight is 30:53 11 and you put it back in the collection 30:57 then you pull out two trees a minimum 30:59 weight this time will be the seven and 31:00 the nine combine them put it back in the 31:05 collection so the new way to 16 then you 31:09 pull out two trees with minimum weight 31:10 they're only 2 11 and 16 combine them 31:14 you get a tree of weight 27 31:21 at this time you're done you have a 31:24 binary tree the external nodes have the 31:29 given weights the weight of the internal 31:34 nodes corresponds to the cost of the 31:36 merger that node and the claim is that 31:43 this binary tree has the minimum 31:45 weighted external path length amongst 31:47 all binary trees possible for the given 31:49 set of weights 32:01 okay or any questions about the 32:03 algorithm yep so I mean if I have these 32:31 weights you mean the time required to 32:32 find the fellow with the minimum we look 32:34 at the implementation of the algorithm 32:35 in a moment okay so let's not worry 32:37 about how much time it takes to run this 32:40 thing okay right so at this point 32:43 there's really a question of at the 32:44 concept level do you understand how the 32:45 algorithm is working all right I once 32:51 you come up with an algorithm such as 32:53 this there are two questions that will 32:57 come that you need to address one is 33:00 does this algorithm really work does it 33:02 really get you binary trees that have 33:05 minimum weighted external path lengths 33:07 at this time you have no reason to 33:09 believe that it does that there's 33:12 nothing in the construction to suggest 33:14 that it really does that so you have to 33:16 prove that it works and we're not going 33:19 to do the proof in class but basically 33:20 you can prove this by induction on the 33:23 number of weights that you start with 33:25 and they're showing that the optimal 33:29 tree for these five weights will always 33:32 have a subtree that has the two four six 33:37 thing in it okay that this has to be a 33:42 configuration in every optimal subtree 33:44 for the given five weights okay so you 33:50 can prove that by prove the whole thing 33:53 by induction where in the induction step 33:54 you simply say need to prove that the 33:58 two trees at that time that have minimum 34:01 weight must be combined together 34:03 otherwise the whole tree cannot be 34:05 optimal now once you establish that the 34:09 algorithm is correct then you move to 34:12 the next step which is what is the 34:14 do it what are the data structures you 34:16 need to implement it how much time is it 34:18 going to take and you want to verify 34:19 that that's a reasonable or acceptable 34:21 amount of time so let's go to the next 34:24 step which is how much time does this 34:26 thing take now you figure out how to 34:32 implement it we know we have a 34:34 collection of weights I need some data 34:38 structure to represent this collection 34:39 of weights and the right data structure 34:42 to use depends on what operations are 34:44 going to perform on that data structure 34:46 so at the start of the algorithm I need 34:52 to initialize this data structure to 34:54 have entries in it then in the reduction 34:59 step I need to extract two trees with 35:02 the minimum weight okay so that's like 35:04 to delete min operations from this 35:06 collection then when I combine I add the 35:11 two weights and this tree gets put back 35:13 in so I need to put a new weight into 35:15 this structure so I need to do an insert 35:21 okay so I need to initialize I need to 35:25 do to delete means I need to do an 35:27 insert and a good structure for that is 35:29 a min heap when you use a min heap you 35:35 can initialize in linear time 35:38 each delete min takes log n time so it's 35:44 the total number of delete mins would be 35:47 n minus 1 the number of rounds to bring 35:49 it down to one tree so there are about 35:52 two n minus one delete min operation so 35:55 that's out of n log n the number of 35:59 inserts is n minus 1 and each insert 36:02 takes log n time that's n log n for the 36:05 inserts we add up everything it's n log 36:09 n so it's a fairly efficient algorithm 36:24 yeah you can make it run faster by a 36:27 constant factor where instead of doing 36:32 two removes and an insert you could do 36:36 one remove and the next one you do and 36:39 increase key operation so once you do 36:44 want to remove the second main item pops 36:46 up to the root you increase it by the 36:48 weight of the first one you removed and 36:50 then you restructure the heap and so 36:56 that should improve the performance by 36:58 some small factor right so instead of 37:02 doing a second delete and an insert okay 37:06 you do a change key operation so let's 37:10 say the first one he took out how to 37:11 weight of two the second one had a way 37:12 to fall so instead of taking the four 37:15 out I'm going to change the four to or 37:16 six and restructure the heap okay from 37:21 the root going down but whichever way 37:29 you go asymptotically is n log n or any 37:41 questions about the implementation 37:50 okay alright next suppose you are 37:56 interested in the higher-order Huffman 37:57 tree so maybe you want to do a sixteen 38:01 way merge or a 20 way merge or a sixty 38:04 way merge then I want to find a binary 38:09 tree with a much higher degree that has 38:12 minimum weighted external path length 38:14 you simply run the greedy algorithm it 38:17 doesn't work because the greedy 38:19 algorithm extended to higher degree 38:21 would be you have a reduction step if 38:23 you're doing a three-way merge you pick 38:25 up three runs with minimum length merge 38:28 them put them back in okay and repeat 38:31 this process you'll get down to one so 38:34 if you try it out here where I have four 38:36 runs three six one and nine so in the 38:39 first step I pick up three of minimum 38:42 length so that's three six and one okay 38:45 their combined weight is ten I put this 38:47 back in my collection now I've got a ten 38:49 and a nine so I tried to pull out three 38:51 but they're on three so pull out what's 38:53 left it's just to the ten and the nine 38:59 so that has a cost of nineteen for that 39:00 last merge okay and that gives me a 39:04 total weight of twenty nine a total cost 39:07 of 29 39:12 but if I'd done it this way where I did 39:15 a to a merge out here followed by a 39:18 three-way so this is a 301 and then you 39:22 got the four six and the nine that has a 39:24 cost of 23 so the gray scheme didn't 39:31 really work here you can probably figure 39:39 out why it didn't work the reason I 39:44 didn't work is that we weren't able to 39:48 really do a three-way merge out there 39:51 you ended up doing a two way merge and a 39:54 two way merge can be viewed as a 39:55 three-way merge which involves a run 40:02 whose length is zero okay so really when 40:08 you get a lower order merge somewhere 40:10 it's like doing a high order merge but 40:12 with some number of empty runs 40:15 okay so when you view it this way you 40:18 suddenly see that this is not what the 40:20 greedy algorithm would have done had it 40:22 known you had a run of length zero okay 40:25 if you had known you had a run of length 40:27 zero 40:27 you'd have first done the 0 1 & 3 so 40:32 that those would be the ones sitting 40:33 down here okay and if you do that then 40:38 you can prove that the greedy algorithm 40:39 works okay so the problem is you started 40:44 off with an incorrect number of runs you 40:46 need to put in a bunch of zero length 40:48 runs and then run the greedy algorithm 40:55 so then you need to figure out well how 40:57 many runs of length zero should you 40:59 start with and basically what you want 41:05 is you want to start with the total 41:06 number of runs so that every time you 41:08 run the reduction step you have if 41:11 you're doing a three-way merge you have 41:12 three runs out there and then in the end 41:14 it goes from three to one 41:17 or if you do a 10-way merge they should 41:19 always be at least ten runs you pull out 41:21 ten you merge put one back that should 41:23 still be ten pull out and then in the 41:25 end you pull out 41:25 and put one back in there's only one you 41:29 should never find yourself doing a 41:30 low-order much all right so let's say 41:37 you're doing a que way mode you're 41:39 trying to build a cave a tree started to 41:45 go in the wrong direction 42:09 all right so suppose that you start with 42:13 a total of our runs okay then we're 42:21 going to add some number of runs let's 42:23 call the number of runs we're going to 42:25 add is Q and that's going to be a 42:27 non-negative number it may be zero it 42:29 could be something more than that each 42:35 time we do a K way merge the number of 42:37 runs decreases by K minus one so if you 42:41 do a six-way mode you pull out six runs 42:43 you put one back in reduction is by five 42:50 and let's suppose we'd repeat this 42:52 process s times okay so then the total 42:56 reduction in the number of runs is s 42:57 times K minus one so I started with our 43:02 I added Q of length zero I ran my 43:05 reduction step s times so now I'm left 43:08 with our plus Q is what I started with 43:11 minus s times K minus one and I want 43:15 that that number okay should be one for 43:18 some integer value of s 43:28 all right so I want that this should be 43:34 equal to one for some integer value of s 43:36 that's the same as writing it that way R 43:44 plus Q minus one is some multiple of K 43:47 minus 1 so R plus Q minus one is some 43:55 multiple of K minus one 44:09 all right so the first thing we conclude 44:11 from that is that the number of runs you 44:13 need to add is never more than K minus 2 44:19 so if you start with integer number of 44:21 runs you can have to add at most K minus 44:24 2 zeros to go to the next multiple of K 44:27 minus 1 44:36 if if our -1 is already a multiple of k 44:42 minus 1 then you don't have to add 44:43 anything okay so in that case Q is 0 ok 44:48 and if our minus 1 is not a multiple of 44:54 K minus 1 well then you take the 44:57 remainder ok whatever is left our minus 45:00 1 divided by K minus 1 take the 45:02 remainder of that and then I need to add 45:05 some number to bring it to K minus 1 so 45:07 I do a K minus 1 minus that remainder so 45:12 K minus 1 minus the remainder of our 45:15 minus 1 divided by K minus 1 that's the 45:18 number of runs that you need to add ok 45:25 all right so the way you determine the 45:27 number of runs to add is you take K 45:29 minus 1 and subtract from that the 45:31 remainder of all minus 1 divided by K 45:33 minus 1 if you add that many runs of 45:36 length 0 then you run the greedy 45:39 algorithm you will never run short of 45:41 runs until you get down to the last step 45:43 you end up with 1 run now that form is 45:49 mathematically equivalent to writing it 45:51 this way is the number you need to add 45:56 is 1 minus our mod K minus 1 1 minus R 46:00 will be a negative number this R is 46:02 bigger than 1 and so you'd be taking the 46:08 mod of a negative number any case uh 46:13 that's just a compact way of writing 46:15 what was out there ok all right so we 46:20 can figure out how many runs of 1 0 to 46:22 add and then you turn the greedy 46:25 algorithm loose you end up with a tree 46:29 of degree K that has the minimum 46:31 weighted external path length 46:37 or any questions how do you decide okay 46:44 so it depends on the application okay so 46:47 if you're doing a que way merge say 46:49 you're doing merge sort you need to know 46:51 what your order should be it would be a 46:55 function of things like how much memory 46:57 do you have because you will need some 46:58 number of buffers okay and we look at 47:02 buffer management next to figure out how 47:04 many buffers you need so if you're doing 47:05 a four-way merge do you need ten buffers 47:08 do you need 20 buffers do you need six 47:10 buffers so depending upon the amount of 47:14 memory you have the block size the 47:18 number of buffers needed for any K you 47:20 can figure out what is the maximum K you 47:21 can use if you're able to change block 47:25 sizes you could draw curves based on 47:27 this speed and stuff and figure out at 47:30 what K you get best performance are 47:37 other questions yeah 47:44 the last word on this slide here or 47:51 further up okay yeah 48:03 you want to go up for eat you want to go 48:07 down there up well 48:17 next one over here 48:21 [Music] 48:26 well if you take any any number like 48:32 let's say you take six and you want to 48:38 know is it a multiple of eight if I add 48:43 two I can get that if I add ten I can 48:47 get that if I had 18 so the amount you 48:51 need to add is is either K minus 1 some 48:57 K minus 1 2 times K minus 1 3 minus K 48:59 minus 1 if you add those plus some small 49:02 amount you'll always get a multiple so 49:05 if you start with say 12 and you want to 49:11 make it a multiple of 5 if it's not 49:14 already a multiple of 5 well then you 49:16 need to only add 3 if you add 5 to it 49:19 you know you won't get a multiple 49:22 because you didn't start with the 49:23 multiple you got too far okay so the 49:32 multiples of K minus 1 are separated by 49:34 a distance of K minus 1 so if you had 49:37 not already at a multiple you have to 49:39 travel less than K minus 1 to get to the 49:41 next multiple all right other questions 49:49 ok 50:11 you