Transcript


00:10
okay in the last lecture we saw how to
00:14
generate runs whose size was larger was
00:19
than the size of memory and potentially
00:22
could be very large you could end up
00:25
with a single run independent of the
00:28
number of Records you were trying to
00:29
sort troy did they were reasonably close
00:31
to being in sorted order
00:33
now that method we saw resulted in runs
00:36
of variable length and once you have
00:40
variable length runs then the simple
00:43
strategy of just making multiple passes
00:45
over the data or over the runs to merge
00:48
them doesn't yield an optimal merge
00:50
strategy so today we want to look at how
00:56
we might figure out the optimal way to
00:59
merge a collection of runs of different
01:01
lengths this is the example we saw last
01:05
time
01:05
and basically this tree here represented
01:08
the multi pass key and the cost for that
01:11
was 44 we saw last time and the other
01:13
one represented a non multi pass scheme
01:17
and that had a better cost of 42 all
01:23
right so to figure out the best way to
01:26
merge a collection of runs will
01:28
introduce a term called the weighted
01:32
external path length to be a property of
01:36
a binary tree and then we'll see that
01:40
the weighted external path length of a
01:43
binary tree actually gives us the cost
01:45
of using that binary tree for merging
01:47
together a collection of runs and so
01:50
what we want to do is we want to
01:51
minimize the weighted external path
01:53
length of the merge tree that is being
01:55
used okay so if you think back to the
01:59
previous slide the multi pass merging
02:07
was described by this binary tree and
02:09
then have a different
02:11
which schemed described by the binary
02:13
tree on the right okay and really any
02:17
binary tree that you come up with we
02:19
describe a merging pattern so our
02:25
objective becomes one of finding the
02:26
right binary tree and the parameter that
02:30
you want optimize on that binary tree is
02:33
something called the weighted external
02:34
path length okay all right so the way
02:38
that external path length of a binary
02:39
tree is you take the weight of an
02:44
external node okay so you have a binary
02:47
tree external nodes of this or whatever
02:50
they would otherwise be another length
02:51
you put an external node and the
02:53
external nodes have weights so you take
02:56
the weight of an external node and
02:57
multiply it by the distance of that node
02:59
from the root of the tree and add that
03:02
up across all external nodes we call
03:04
that the weighted external path length
03:08
so for example here for external nodes
03:12
shown in white and their weights are
03:14
four three six and nine so you look at
03:16
those as the length of runs the rated
03:21
external path length of this binary tree
03:23
will be four times the distance of this
03:25
node from the root so this one is one
03:27
two hops from the root so 4 times 2 plus
03:31
3 times 2 plus 6 times 2 plus 9 times 2
03:37
okay that works out to be 44 which also
03:43
was the merge cost of this tree and
03:48
that's not just a coincidence okay
03:52
the cost of this node is 7 as far as
03:55
merging goes the cost of that node is 22
03:58
for merging and the cost of that was 15
04:00
the add up those 7 15 22 we got 44 and
04:04
and the reason that song equals what you
04:06
get that way is the 4 out here ok these
04:10
four records
04:11
emerged at this node they're also merged
04:15
again at that node ok so these four
04:18
records are a red in here they red in
04:22
there they merged here merge there
04:24
out here written out there so four times
04:27
two is its contribution to the total
04:30
cost similarly these fellows are read in
04:33
twice written out twice much twice so
04:35
you take three times two okay and that's
04:38
why when you take the weight of an
04:40
external node and multiplied by its
04:42
distance from the root and add up you
04:44
get the same thing as we got when we
04:46
said over here there's a cost of 7 and
04:48
there's a cost of 15 on this merge cost
04:50
22 so the the weighted external path
04:55
length is exactly equal to the merge
04:58
cost look at it for the other tree same
05:02
thing here four times this distance from
05:05
the root this is three hops from the
05:07
roots of four times 3 3 times 3 6 times
05:11
2 and 9 times 1 and that works out to 42
05:15
just again the merge cost of that tree
05:17
again that's not a coincidence because
05:20
these four records here will be read in
05:23
once here again here again they're three
05:26
times okay and written out three times
05:28
in most three times these nine are read
05:32
in only once at this node and written
05:35
out once and which one so it's nine
05:37
times one all right so does everybody
05:42
see the connection between weighted
05:43
external path length and the merge cost
05:45
over binary tree also every binary tree
05:50
defines a merge pattern so every binary
05:55
tree that has four external nodes will
05:58
define a possible way to merge four runs
06:01
together okay any questions about that
06:08
all right all right so objective becomes
06:16
then given the weights of the external
06:20
nodes okay so you run the first phase of
06:24
merge sort the run generation phase and
06:26
let's suppose it produces 83 runs
06:29
following that you know the weights you
06:31
know the length of the runs so you're
06:34
given the number of runs 83 you given
06:35
their weights you want to find
06:38
a binary tree with minimum weighted
06:40
external path length it's got to have 83
06:43
external nodes and the weight of these
06:45
nodes is known to you right so that's
06:49
the problem we want to solve okay before
06:52
looking at a solution to the problem
06:54
let's take a look at a couple of other
06:56
applications for binary trees with
07:00
minimum weighted external path length
07:03
one of the application you look at is -
07:06
message coding and decoding and the
07:09
other one is - lossless data compression
07:14
all right so suppose you want to send
07:16
messages so here I've got a collection
07:19
of n messages labeled m 0 through m n
07:25
the messages are known ahead of time
07:28
these are standard boilerplate messages
07:31
they don't change in time so somebody
07:34
comes to you and says send message 3 to
07:36
my friend out here or send message 6 and
07:38
message 8 and the receiving station they
07:44
know the content of a message but just
07:46
knowing the index so if I know message 8
07:49
came I know what it is I can print out
07:50
the full text right so the messages are
07:54
known they don't change in time and the
07:59
transmitter and receiver given the index
08:02
know what the messages so they don't
08:08
really transmit the message okay so the
08:10
message may be long maybe several
08:12
kilobytes long megabytes long instead
08:14
they transmit a code which is short and
08:19
then for the user at the other end the
08:22
full message can be printed out all
08:27
right so our objective is to come up
08:30
with the codes that should be used and
08:33
to come up with the codes we assume that
08:36
we know the frequency with which each
08:38
message is going to be sent so some
08:41
messages are sent very infrequently
08:43
summer sent very frequently and we know
08:45
the frequency or relative frequency with
08:48
which messages will be sent
08:51
and we want to develop the optimal codes
08:53
to use so optimal here means you want to
09:02
optimize the transmission costs and you
09:05
want to optimize the decoding cost all
09:14
right suppose you have four messages and
09:16
the frequencies are 2 4 8 and 100 now
09:23
you could use two bit codes okay so each
09:26
message has a code of the same length so
09:29
we get four codes 0 0 0 1 1 0 and 1 1 so
09:35
if you transmit 0 0 the receiver knows
09:37
it was the first message if you transmit
09:40
1 1 the receiver knows it's the last
09:41
message the transmission cost is the
09:49
length of the code times the frequency
09:52
of the code or the way I've got it here
09:54
is really frequency times the length so
09:57
the first one is sent with the frequency
09:59
of 2 the length is 2 so it's going to
10:02
constitute 4 and then the second one is
10:04
sent 4 times length of 2 that's 8 bits
10:06
of transmission the last one has sent
10:08
100 times at a cost of 2 2 bits for
10:11
transmitted say a total transmission
10:15
cost works out to be 228 if you use this
10:18
code
10:23
okay to to do the decoding at the
10:28
receiving end we'll make use of a binary
10:30
decode tree which looks like this so you
10:36
look at the first bit in the code if
10:39
it's a zero you move to this side you
10:41
look at the if it's a one you move to
10:43
that side and here you look at the
10:44
second bit in the code if it's a zero
10:46
you come here otherwise it's the one you
10:48
go over there okay so with zero zero you
10:51
get m 0 you get 0 1 you get M 1 1 0 you
10:54
get em 2 1 1 you get M 3 ok all right so
11:01
the cost of decoding message m0 is 2
11:06
because I got to do two branches in my
11:08
decode tree to decode M 1 is also a
11:11
constant of two you got to do two
11:13
branches and same thing for each of the
11:15
others so the decoding cost is the
11:23
frequency times the distance from the
11:26
root
11:36
well the decoding cost will always be
11:39
equal to the transmission cost and it
11:41
will always be equal to the weighted
11:42
external path length of the decode tree
11:47
so to find optimal codes you need to
11:50
find an optimal decode tree well which I
11:55
mean find a binary tree whose weighted
11:58
external path length is minimum all
12:02
right so our problem of merging runs of
12:04
variable lengths optimally is identical
12:07
to the problem of finding optimal codes
12:12
any questions about that right it's an
12:31
observation here just like every binary
12:34
tree with n external nodes could be used
12:36
to merge n runs every binary tree with n
12:40
external nodes defines a code pattern
12:42
for n messages so you just label the
12:46
edges 0 and 1 if it's a left edge or a
12:48
right edge and you read the labels on
12:50
the way down that gives you the code to
12:53
use now the reason the well I guess we
13:01
have a decoding cost is against the same
13:04
thing you just take the weight of the
13:07
node of the frequency times its distance
13:09
from the root is its contribution to the
13:12
total decoding cost so in this case it
13:14
works out to 144 in transmission same as
13:20
the transmission costs the weighted
13:21
external path length
13:25
all right let's look at another example
13:28
okay this time I've got ten messages m0
13:35
through m9 again I can figure out the
13:41
codes by just following the path from
13:43
the root to any message and read off the
13:46
labels okay
13:47
so zero 1 1 is the code for m500 1 is
13:51
the code so 0 0 1 0 would be the code
13:54
for m2 so just follow a path from the
13:57
root to the external node that gives you
13:59
its code now the reason that every
14:10
binary tree defines a code is that when
14:17
you use a tree no path to an external
14:20
node is a prefix of another path to an
14:23
external node so for example we have a
14:27
path out here which is 1/1 but there's
14:31
no other path from the root to an
14:32
external node which is 1 1 followed by
14:34
something that property is important in
14:40
decoding ok
14:44
because when the messages are sent
14:46
you're sending long strings of binary
14:49
bits okay see you when you send them
14:51
line you send one one but then
14:54
immediately after that you might send m0
14:56
so you'll send 0 0 0 and another 0 but
15:01
there's no marker in between those two
15:03
messages so if so by looking at 1 1 ok
15:11
you follow this decode tree you do a 1 1
15:14
you end out here you notice m9 and then
15:17
you know that the next bit defines the
15:18
start of the next message and you can
15:24
make that conclusion only if no code is
15:26
a prefix of another prefix
15:30
so when you're going to variable length
15:32
prefixes in order for things to work and
15:35
you're sending streams of long streams
15:38
of codes one after the other no message
15:41
or no proof no code should be a prefix
15:43
of another code and using trees to
15:47
define the codes guarantees that we'll
15:53
see that when we look at the lossless
15:55
data compression example how we make use
15:57
of that property
15:59
all right so in lossless data
16:01
compression you may for example have a
16:04
long piece of text it says composed of
16:07
characters and you want to read write
16:12
that piece of text using a smaller
16:16
number of bits so for example suppose we
16:22
have text that's made up of four
16:24
characters only a B C and D and let's
16:33
suppose you have a string that has ten
16:36
A's and at five B's in it 100 CS and
16:39
it's got 900 DS
16:47
now I could write my string since there
16:50
are four characters I could say well I
16:52
can get by with two-bit codes that will
16:55
give me four different codes so use
16:57
codes of the same length and when codes
17:02
are the same length it doesn't matter
17:03
which code is assigned to which
17:05
character the frequency doesn't play a
17:07
role because all their lengths are the
17:08
same okay so let's say a is represented
17:13
by 0 0 B by 0 1 C by 1 0 and D by 1 1
17:16
then the length of the string using this
17:20
coding becomes there are 10 A's serves
17:24
10 times 2 5 B's 5 times 2 100 C's 100
17:28
times - 900 DS 900 times 2 so it takes
17:33
you 2030 bits to represent this strength
17:38
plus of course you need to have a code
17:41
table which Maps the codes to characters
17:45
okay so you have a code table with 4
17:47
entries ABCD okay Oh perhaps in this
17:52
case it could even be implicit right we
17:59
really suppose instead you use a
18:01
variable length code with the property
18:03
that no code is a prefix of another all
18:09
right so the code I'm going to use is a
18:11
is represented by 3 bits 0 0 0 B is also
18:16
3 bits see is a 2-bit code in these are
18:18
1 bit code and here you can see that no
18:25
code is a prefix of another and this
18:28
actually came from a binary tree all
18:32
right so the size of the string is going
18:34
to be there were 10 a is five B's a
18:37
hundred C's and 900 B's so 10 times 3 5
18:41
times 3 100 times 2 or 900 times 1 so
18:46
that works out to 1145 bits plus the
18:49
size of the code table
18:53
okay but ignoring the size of the code
18:55
table the difference between the two
18:59
schemes affords us a compression of
19:02
about 1.8 so using about half the space
19:06
I can represent this strength now this
19:11
of course is a useful representation
19:12
only if from this binary string you can
19:16
recover the string that you started with
19:19
okay you could take any input string and
19:24
represented with 0 bits just throw it
19:25
away which you can't recover what you
19:27
started with okay so you have to be able
19:29
to recover what you started with to have
19:31
a reasonable compression scheme right so
19:36
you have to be able to decompress your
19:40
stuff get back what you started with and
19:41
so for that I'm going to use this decode
19:43
tree yeah yeah
19:53
for some defendants and
19:57
I mean question that was received here
20:28
1.8 is really a function of the
20:30
frequencies of the characters and so to
20:33
come up with the proper codes do you
20:34
have to examine the data that's right
20:37
the code will be data dependent okay and
20:43
so if you look at at lossless
20:49
compression schemes the most common one
20:50
would be like the lzw scheme the codes
20:53
are developed on the fly you examine the
20:55
string left to right and you develop the
20:56
coding it's very data dependent all
21:02
right so so you're right the code is
21:05
data dependent and the only reason for
21:09
representing D by one bit was because it
21:11
was occurring so many more times than
21:12
any other character all right so to do
21:20
the decoding you build a binary tree
21:22
based on the code so if you see a 0 0 0
21:25
it's an a so you're 0 0 1 it's a B or 0
21:27
1 it's a see if you see a 1 it's a deep
21:33
all right so let's suppose you're given
21:35
this string okay so you started off with
21:38
A's B's and C's and then you compressed
21:41
it you ended up with a binary string and
21:42
now you have to recover the original
21:44
string okay so I'll go through this okay
21:48
and go down my decode tree so I have a 0
21:52
0 0 so that ends up here so these first
21:55
three zeros gives me an A okay all right
22:00
so I'll get the a out of that all right
22:03
then I pop back up to the root I see a 1
22:06
so that gets me a D alright then I
22:11
see that one that gets me ad then I see
22:15
a zero so come here us I see another
22:17
zero I come here a one get to another
22:19
external node so that gets me of B right
22:23
then I see a zero and I get a one I get
22:25
a C okay so every time you reach an
22:28
external node you get a character then
22:29
you pop back up to the root there are no
22:32
boundaries in the binary string between
22:35
the different codes and the fact that no
22:40
code is a prefix of another shows you
22:43
that you can do the reverse
22:44
transformation
22:59
okay all right so to maximize the
23:06
compression given a string we'd like to
23:10
have a binary tree with minimum weighted
23:14
external path length and based upon that
23:16
we will develop the codes for the
23:18
different characters in the text that we
23:20
are compressing all right any questions
23:25
about lossless data compression using
23:28
this strategy will be just taking
23:53
very close difference between how you
24:05
develop the code and how you implement
24:07
the actual compression and decompression
24:10
okay so what I'm trying to get across
24:13
here is that you first have to have that
24:16
table which maps from characters to the
24:20
codes and how do you get the codes okay
24:23
and the way you get the optimal codes is
24:26
by thinking in terms of a minimum
24:28
weighted external path length trip now
24:34
whether you go and actually implement
24:35
the decompression using a binary tree or
24:40
just having you know some other data
24:43
structure is a different story yeah it's
24:53
not limited to binary codes you can have
24:55
ternary codes or any degree codes you
24:58
would again be looking for minimum
25:01
weighted external path length trees
25:19
standard shape of binary trees is most
25:21
efficient what do you mean by the
25:22
standard shape okay all right so a
25:29
balanced binary tree
25:33
alright a balanced binary tree for this
25:38
particular example would be bad because
25:40
the frequencies are so lopsided but if
25:44
all the frequencies were the same then
25:46
the balanced binary tree would be the
25:47
best way to go so it really depends on
25:50
the frequencies with which different
25:52
letters appear or the weights of the
25:55
runs as we saw with okay with merging
26:00
runs if all the runs are the same length
26:02
then you use a merge tree which looks
26:05
like a full binary tree it's only when
26:08
the run lengths are very different that
26:11
you end up using a skewed binary tree oh
26:14
is that what you're looking forward yeah
26:17
okay all right other questions
26:22
all right so trees that have minimum
26:26
weighted external path length they
26:27
called Huffman trees and the
26:29
corresponding codes are called Huffman
26:30
codes and you probably heard of Huffman
26:33
codes they were extensively used in a
26:35
variety of applications and we just
26:37
looked at three of them here
26:40
all right so their minimal ready to
26:43
external path length Huffman trees don't
26:46
have to be binary they could be of any
26:47
degree if you're trying to find binary
26:52
trees with your own weighted external
26:54
path length you can do that using a
26:55
greedy algorithm we look at that in
26:57
moment when you want to go to higher
27:00
degree trees you need to have a
27:05
pre-processing step and then run the
27:07
greedy algorithm after that you get the
27:09
best external weighted path length tree
27:14
okay
27:17
let's look at the greedy algorithm for
27:20
binary trees so basically what we're
27:25
trying to do is if you're thinking about
27:27
merging we've had a bunch of runs I want
27:29
to merge the runs together two at a time
27:33
so if I gave you some number of runs
27:37
like six six and I said merge them
27:38
together two at a time if you do it in a
27:41
greedy fashion they're most likely what
27:43
you would do is you would first pick the
27:45
runs of minimum length out of these six
27:48
merge them together gives you a new run
27:50
put it back in your collection now
27:51
you've got five runs and you repeat this
27:53
process till they left with one run so
27:56
each time you have to make a decision
27:57
which two runs to merge you make the
28:00
decision in a greedy fashion find the
28:02
two runs of smallest length merging them
28:04
will cost you the least so you merge
28:05
them and put them back in the collection
28:07
of runs all right so that's the greedy
28:12
algorithm we want to build a tree and
28:15
say start with as many trees as you have
28:18
runs or weights each tree has only an
28:22
external node okay so that represents in
28:25
a initial configuration and in
28:27
stand-alone runs then we're going to
28:32
reduce the number of trees in our
28:33
collection by one by performing a greedy
28:35
step and in this greedy step we select
28:39
two trees that have the smallest weight
28:42
we would combine them together so
28:45
perhaps do a merge or if you're building
28:48
the tree that means take an internal
28:50
node and make these two trees sub trees
28:52
of that internal node we give the newly
28:59
created tree a weight that's the sum of
29:01
the weights of the two trees that were
29:02
combined and put that new tree back in
29:06
the collection of trees right so this
29:10
reduces the number of trees by one okay
29:13
pick two trees of minimum weight combine
29:17
them into one tree by adding a new
29:19
internal node make the two trees here
29:22
sub trees of that the weight of the new
29:24
tree is the sum of the weight of the two
29:26
trees you just combined and then
29:28
put it back in the collection okay so
29:32
this reduces the number of trees by one
29:34
and then you repeat this reduction step
29:37
till you down to one tree let's see an
29:47
example all right so let's have that
29:51
five weights so maybe you got five runs
29:54
and the lengths are 2 5 4 7 and 9 so I
30:00
start with a collection of 5 trees each
30:02
tree has an external node only and the
30:08
weight of a tree is the is one of the
30:11
weights that you started with so in the
30:15
first reduction step I pick two trees a
30:17
minimum weight there will be the two in
30:19
the for take him out of that pool
30:22
combine them so you get a new internal
30:26
node and you give it a weight which is
30:28
the sum of the weights of the two trees
30:30
and then put it back in the collection
30:37
so I've down from five to four then I
30:41
pick two trees a minimum weight so
30:43
there'll be the five and the six
30:45
take him out combine them so the five
30:50
and the six you combine it with a new
30:51
node you get a new tree whose weight is
30:53
11 and you put it back in the collection
30:57
then you pull out two trees a minimum
30:59
weight this time will be the seven and
31:00
the nine combine them put it back in the
31:05
collection so the new way to 16 then you
31:09
pull out two trees with minimum weight
31:10
they're only 2 11 and 16 combine them
31:14
you get a tree of weight 27
31:21
at this time you're done you have a
31:24
binary tree the external nodes have the
31:29
given weights the weight of the internal
31:34
nodes corresponds to the cost of the
31:36
merger that node and the claim is that
31:43
this binary tree has the minimum
31:45
weighted external path length amongst
31:47
all binary trees possible for the given
31:49
set of weights
32:01
okay or any questions about the
32:03
algorithm yep so I mean if I have these
32:31
weights you mean the time required to
32:32
find the fellow with the minimum we look
32:34
at the implementation of the algorithm
32:35
in a moment okay so let's not worry
32:37
about how much time it takes to run this
32:40
thing okay right so at this point
32:43
there's really a question of at the
32:44
concept level do you understand how the
32:45
algorithm is working all right I once
32:51
you come up with an algorithm such as
32:53
this there are two questions that will
32:57
come that you need to address one is
33:00
does this algorithm really work does it
33:02
really get you binary trees that have
33:05
minimum weighted external path lengths
33:07
at this time you have no reason to
33:09
believe that it does that there's
33:12
nothing in the construction to suggest
33:14
that it really does that so you have to
33:16
prove that it works and we're not going
33:19
to do the proof in class but basically
33:20
you can prove this by induction on the
33:23
number of weights that you start with
33:25
and they're showing that the optimal
33:29
tree for these five weights will always
33:32
have a subtree that has the two four six
33:37
thing in it okay that this has to be a
33:42
configuration in every optimal subtree
33:44
for the given five weights okay so you
33:50
can prove that by prove the whole thing
33:53
by induction where in the induction step
33:54
you simply say need to prove that the
33:58
two trees at that time that have minimum
34:01
weight must be combined together
34:03
otherwise the whole tree cannot be
34:05
optimal now once you establish that the
34:09
algorithm is correct then you move to
34:12
the next step which is what is the
34:14
do it what are the data structures you
34:16
need to implement it how much time is it
34:18
going to take and you want to verify
34:19
that that's a reasonable or acceptable
34:21
amount of time so let's go to the next
34:24
step which is how much time does this
34:26
thing take now you figure out how to
34:32
implement it we know we have a
34:34
collection of weights I need some data
34:38
structure to represent this collection
34:39
of weights and the right data structure
34:42
to use depends on what operations are
34:44
going to perform on that data structure
34:46
so at the start of the algorithm I need
34:52
to initialize this data structure to
34:54
have entries in it then in the reduction
34:59
step I need to extract two trees with
35:02
the minimum weight okay so that's like
35:04
to delete min operations from this
35:06
collection then when I combine I add the
35:11
two weights and this tree gets put back
35:13
in so I need to put a new weight into
35:15
this structure so I need to do an insert
35:21
okay so I need to initialize I need to
35:25
do to delete means I need to do an
35:27
insert and a good structure for that is
35:29
a min heap when you use a min heap you
35:35
can initialize in linear time
35:38
each delete min takes log n time so it's
35:44
the total number of delete mins would be
35:47
n minus 1 the number of rounds to bring
35:49
it down to one tree so there are about
35:52
two n minus one delete min operation so
35:55
that's out of n log n the number of
35:59
inserts is n minus 1 and each insert
36:02
takes log n time that's n log n for the
36:05
inserts we add up everything it's n log
36:09
n so it's a fairly efficient algorithm
36:24
yeah you can make it run faster by a
36:27
constant factor where instead of doing
36:32
two removes and an insert you could do
36:36
one remove and the next one you do and
36:39
increase key operation so once you do
36:44
want to remove the second main item pops
36:46
up to the root you increase it by the
36:48
weight of the first one you removed and
36:50
then you restructure the heap and so
36:56
that should improve the performance by
36:58
some small factor right so instead of
37:02
doing a second delete and an insert okay
37:06
you do a change key operation so let's
37:10
say the first one he took out how to
37:11
weight of two the second one had a way
37:12
to fall so instead of taking the four
37:15
out I'm going to change the four to or
37:16
six and restructure the heap okay from
37:21
the root going down but whichever way
37:29
you go asymptotically is n log n or any
37:41
questions about the implementation
37:50
okay alright next suppose you are
37:56
interested in the higher-order Huffman
37:57
tree so maybe you want to do a sixteen
38:01
way merge or a 20 way merge or a sixty
38:04
way merge then I want to find a binary
38:09
tree with a much higher degree that has
38:12
minimum weighted external path length
38:14
you simply run the greedy algorithm it
38:17
doesn't work because the greedy
38:19
algorithm extended to higher degree
38:21
would be you have a reduction step if
38:23
you're doing a three-way merge you pick
38:25
up three runs with minimum length merge
38:28
them put them back in okay and repeat
38:31
this process you'll get down to one so
38:34
if you try it out here where I have four
38:36
runs three six one and nine so in the
38:39
first step I pick up three of minimum
38:42
length so that's three six and one okay
38:45
their combined weight is ten I put this
38:47
back in my collection now I've got a ten
38:49
and a nine so I tried to pull out three
38:51
but they're on three so pull out what's
38:53
left it's just to the ten and the nine
38:59
so that has a cost of nineteen for that
39:00
last merge okay and that gives me a
39:04
total weight of twenty nine a total cost
39:07
of 29
39:12
but if I'd done it this way where I did
39:15
a to a merge out here followed by a
39:18
three-way so this is a 301 and then you
39:22
got the four six and the nine that has a
39:24
cost of 23 so the gray scheme didn't
39:31
really work here you can probably figure
39:39
out why it didn't work the reason I
39:44
didn't work is that we weren't able to
39:48
really do a three-way merge out there
39:51
you ended up doing a two way merge and a
39:54
two way merge can be viewed as a
39:55
three-way merge which involves a run
40:02
whose length is zero okay so really when
40:08
you get a lower order merge somewhere
40:10
it's like doing a high order merge but
40:12
with some number of empty runs
40:15
okay so when you view it this way you
40:18
suddenly see that this is not what the
40:20
greedy algorithm would have done had it
40:22
known you had a run of length zero okay
40:25
if you had known you had a run of length
40:27
zero
40:27
you'd have first done the 0 1 & 3 so
40:32
that those would be the ones sitting
40:33
down here okay and if you do that then
40:38
you can prove that the greedy algorithm
40:39
works okay so the problem is you started
40:44
off with an incorrect number of runs you
40:46
need to put in a bunch of zero length
40:48
runs and then run the greedy algorithm
40:55
so then you need to figure out well how
40:57
many runs of length zero should you
40:59
start with and basically what you want
41:05
is you want to start with the total
41:06
number of runs so that every time you
41:08
run the reduction step you have if
41:11
you're doing a three-way merge you have
41:12
three runs out there and then in the end
41:14
it goes from three to one
41:17
or if you do a 10-way merge they should
41:19
always be at least ten runs you pull out
41:21
ten you merge put one back that should
41:23
still be ten pull out and then in the
41:25
end you pull out
41:25
and put one back in there's only one you
41:29
should never find yourself doing a
41:30
low-order much all right so let's say
41:37
you're doing a que way mode you're
41:39
trying to build a cave a tree started to
41:45
go in the wrong direction
42:09
all right so suppose that you start with
42:13
a total of our runs okay then we're
42:21
going to add some number of runs let's
42:23
call the number of runs we're going to
42:25
add is Q and that's going to be a
42:27
non-negative number it may be zero it
42:29
could be something more than that each
42:35
time we do a K way merge the number of
42:37
runs decreases by K minus one so if you
42:41
do a six-way mode you pull out six runs
42:43
you put one back in reduction is by five
42:50
and let's suppose we'd repeat this
42:52
process s times okay so then the total
42:56
reduction in the number of runs is s
42:57
times K minus one so I started with our
43:02
I added Q of length zero I ran my
43:05
reduction step s times so now I'm left
43:08
with our plus Q is what I started with
43:11
minus s times K minus one and I want
43:15
that that number okay should be one for
43:18
some integer value of s
43:28
all right so I want that this should be
43:34
equal to one for some integer value of s
43:36
that's the same as writing it that way R
43:44
plus Q minus one is some multiple of K
43:47
minus 1 so R plus Q minus one is some
43:55
multiple of K minus one
44:09
all right so the first thing we conclude
44:11
from that is that the number of runs you
44:13
need to add is never more than K minus 2
44:19
so if you start with integer number of
44:21
runs you can have to add at most K minus
44:24
2 zeros to go to the next multiple of K
44:27
minus 1
44:36
if if our -1 is already a multiple of k
44:42
minus 1 then you don't have to add
44:43
anything okay so in that case Q is 0 ok
44:48
and if our minus 1 is not a multiple of
44:54
K minus 1 well then you take the
44:57
remainder ok whatever is left our minus
45:00
1 divided by K minus 1 take the
45:02
remainder of that and then I need to add
45:05
some number to bring it to K minus 1 so
45:07
I do a K minus 1 minus that remainder so
45:12
K minus 1 minus the remainder of our
45:15
minus 1 divided by K minus 1 that's the
45:18
number of runs that you need to add ok
45:25
all right so the way you determine the
45:27
number of runs to add is you take K
45:29
minus 1 and subtract from that the
45:31
remainder of all minus 1 divided by K
45:33
minus 1 if you add that many runs of
45:36
length 0 then you run the greedy
45:39
algorithm you will never run short of
45:41
runs until you get down to the last step
45:43
you end up with 1 run now that form is
45:49
mathematically equivalent to writing it
45:51
this way is the number you need to add
45:56
is 1 minus our mod K minus 1 1 minus R
46:00
will be a negative number this R is
46:02
bigger than 1 and so you'd be taking the
46:08
mod of a negative number any case uh
46:13
that's just a compact way of writing
46:15
what was out there ok all right so we
46:20
can figure out how many runs of 1 0 to
46:22
add and then you turn the greedy
46:25
algorithm loose you end up with a tree
46:29
of degree K that has the minimum
46:31
weighted external path length
46:37
or any questions how do you decide okay
46:44
so it depends on the application okay so
46:47
if you're doing a que way merge say
46:49
you're doing merge sort you need to know
46:51
what your order should be it would be a
46:55
function of things like how much memory
46:57
do you have because you will need some
46:58
number of buffers okay and we look at
47:02
buffer management next to figure out how
47:04
many buffers you need so if you're doing
47:05
a four-way merge do you need ten buffers
47:08
do you need 20 buffers do you need six
47:10
buffers so depending upon the amount of
47:14
memory you have the block size the
47:18
number of buffers needed for any K you
47:20
can figure out what is the maximum K you
47:21
can use if you're able to change block
47:25
sizes you could draw curves based on
47:27
this speed and stuff and figure out at
47:30
what K you get best performance are
47:37
other questions yeah
47:44
the last word on this slide here or
47:51
further up okay yeah
48:03
you want to go up for eat you want to go
48:07
down there up well
48:17
next one over here
48:21
[Music]
48:26
well if you take any any number like
48:32
let's say you take six and you want to
48:38
know is it a multiple of eight if I add
48:43
two I can get that if I add ten I can
48:47
get that if I had 18 so the amount you
48:51
need to add is is either K minus 1 some
48:57
K minus 1 2 times K minus 1 3 minus K
48:59
minus 1 if you add those plus some small
49:02
amount you'll always get a multiple so
49:05
if you start with say 12 and you want to
49:11
make it a multiple of 5 if it's not
49:14
already a multiple of 5 well then you
49:16
need to only add 3 if you add 5 to it
49:19
you know you won't get a multiple
49:22
because you didn't start with the
49:23
multiple you got too far okay so the
49:32
multiples of K minus 1 are separated by
49:34
a distance of K minus 1 so if you had
49:37
not already at a multiple you have to
49:39
travel less than K minus 1 to get to the
49:41
next multiple all right other questions
49:49
ok
50:11
you