Transcript


00:16
okay all right any questions before we
00:19
start today's lecture all right so we're
00:26
good to continue with our discussion of
00:29
external sort the and we're still
00:32
looking at enhancing merge sort the part
00:38
that remains here is the second phase of
00:40
merge sort in which we merge together
00:42
runs now we've seen that when we use the
00:49
loser tree method to generate runs we
00:51
end up with runs of different lengths
00:52
and then we will use the Huffman tree
00:57
construction to figure out an optimal
00:59
order in which these run should be
01:01
merged and now we need to look at how do
01:03
we actually take say K runs if we're
01:06
doing a KY merge and merge these
01:07
together into one run all right so this
01:14
is a slide we had up when we looked at
01:16
the initial version of merge sort where
01:19
we said that to improve the second phase
01:22
we wanted to reduce the number of merge
01:24
passes and at that time we were dealing
01:27
with runs of the same size and so we
01:30
thought about using a higher-order merge
01:32
and that translated into a number of
01:36
passes that was given by log base K of
01:40
the number of runs of course now we're
01:42
not going to use this idea of making
01:44
passes instead we're going to use a
01:46
merge tree as determined by a Huffman
01:48
tree all right so still using a
01:57
higher-order merge will reduce the total
01:59
cost of the much okay so we'll use a
02:04
half one tree of order K now
02:14
we of course need to select K but once
02:17
you have selected K we do want to be
02:19
able to overlap the input/output and
02:22
internal merging that takes place and
02:24
for that to happen we need to have two
02:27
disks as was the case for similar
02:32
overlapping during run generation and
02:35
then we also need to be able to take our
02:40
memory and organize it so that it's
02:43
possible to perform these three tasks in
02:46
parallel okay we want to be able to read
02:49
from our disk and while we are reading
02:52
we want to be able to write parts of
02:54
runs that are generated and we want to
02:57
be able to perform internal merging or
02:59
generation of the run so we want to be
03:02
able to do these these these three
03:04
things in parallel and this is a diagram
03:07
that we've used in a previous lecture
03:09
and basically down here is the
03:12
synchronization step where when one of
03:15
these three things finish you kind of
03:17
wait for the other two to finish and
03:19
then you proceed to the next cycle
03:24
alright so in order to be able to do
03:29
input/output and merging in parallel
03:33
we're going to need two output buffers
03:36
okay you're going to need one in which
03:38
you're going to generate the next block
03:40
for the run and another one that's going
03:43
to be the current block that's being
03:46
written out okay so you're going to need
03:48
two output buffers okay just call them
03:53
output buffer 0 and output buffer 1 so
03:56
we might for example be filling up this
03:58
buffer using our CPU doing internal
04:01
merging it'll be writing this one out to
04:03
disk and then we could interchange these
04:05
two we could be filling this one and
04:07
then writing this one out ok so we need
04:11
two output buffers we also need input
04:16
buffers so if you're going to do a K way
04:19
merge we need at least one buffer load
04:22
from each of these K runs so that's K
04:25
input buffers
04:26
and then another buffer into which
04:27
you're reading the next buffaloed so you
04:30
can keep the read active so you need at
04:33
least k plus 1 input buffers but k plus
04:38
1 is really not enough ok if you just
04:41
have k plus 1 you can come up with the
04:43
example to show that no matter how you
04:45
fill this extra buffer you can find
04:47
yourself in a situation where you don't
04:50
have enough data to proceed with the
04:53
merge so you have to wait for input it
05:04
turns out that you really need to have
05:06
at least two k buffers to get things to
05:09
work right so we need at least two K
05:14
buffers and four 2k to work you need to
05:19
be careful about how you use these 2k
05:21
buffers so if you take a 2k and allocate
05:25
two buffers per run so this kind of a
05:29
static allocation of buffers then the
05:31
text has an example that shows that you
05:34
will run out of input in a run so that
05:38
you have to wait for replenishing this
05:41
run to continue with the merging you
05:45
really need a dynamic allocation of
05:47
buffers will you do some kind of
05:49
prediction which says based on the data
05:53
you currently have in a memory you
05:55
expect to run out of data in run three
05:57
first and so we could bring in the next
06:00
buffaloed for run three okay and then
06:04
when it's time to read again you look at
06:06
the last records read from all of your
06:08
runs and look at which one is going to
06:10
run out first that's the one with the
06:12
smallest key and then you bring the next
06:14
buffaloed from that run okay so you
06:17
predict which one is going to exhaust
06:19
first and whenever it's time to read
06:21
another buffaloed you read from that run
06:29
all right so
06:32
we're going to use a predictive strategy
06:34
and using a predictive strategy we'll be
06:36
able to get by with two K buffers the
06:40
reason we want to minimize the number of
06:41
buffers of course is the order of the
06:45
merge you use is constrained by the size
06:49
of a block which is the same as the size
06:51
of a buffer and also by the size of
06:54
memory that you have so to use the
06:58
highest possible degree of merge you
07:00
need to use the smallest number of
07:02
buffers all right so if you do a static
07:09
two input buffer it doesn't really work
07:13
2k are not enough and we will be
07:17
allocating buffers in a dynamic fashion
07:19
on an as-needed basis so it will be
07:23
possible that at one time we may have
07:27
like k plus 1 buffers assigned to one
07:31
run and the remaining runs have only one
07:33
buffer each assigned to them and at some
07:35
other time you may have two per run at
07:38
some other time you may have some other
07:39
distribution of buffer allocation
07:41
anything is possible okay so as I said
07:48
we will determine which one is going to
07:50
exhaust first and that's done by taking
07:53
a look at the last record that's been
07:54
read from each run the fellow with the
07:58
smallest last record read is the one
08:01
that will exhaust first
08:07
in case of a tie we will need an
08:11
enforceable tiebreaker which is used by
08:13
our k way merge to predict which one is
08:15
going to exhaust first okay all right so
08:18
we can predict which fellow will exhaust
08:19
first because of the tiebreaker and then
08:22
we will read from that run when the next
08:25
opportunity to read arises all right so
08:35
to manage our buffers we will employ Q's
08:42
okay so I've got if you're doing say in
08:46
this case you're doing a nine way merge
08:48
K is nine okay
08:50
so have mine Q's of buffers this is from
08:54
run number zero run one run two up
08:56
through run eight okay each queue can
08:59
have some number of buffers in it you
09:02
have a front of Q pointer and a rear of
09:04
Q pointer okay so in this particular
09:06
case in this configuration run zero has
09:09
three buffaloed of data in it in memory
09:12
run one has one buffer load of data run
09:15
two as one buffaloed run four has to
09:17
buffer loads run six has three buffer
09:20
words the number of buffers assigned to
09:24
a run will vary in time depending upon
09:28
need okay so if it's time to read a
09:31
buffaloed we've take a look at the last
09:33
record here in this buffer the last one
09:37
out there the last one here the last one
09:39
there and so on and see which has the
09:41
smallest last key okay and that's the
09:44
run for which we will run for which will
09:47
read the next buffer load and then that
09:49
buffaloed will get attached to the end
09:51
of the queue all right so we have two
09:55
output buffers okay if this is the
09:58
current active output buffer then we'll
10:00
be merging from the front of these runs
10:02
into the current active output buffer
10:10
all right so this is the setup for the
10:13
buffers okay we have queues of buffers
10:16
we have K such queues and then you have
10:24
some number of input buffers that are
10:26
currently not in use they could be
10:28
sitting here in a pool of free buffers
10:34
all right so when it's time to read a
10:36
block we will get a free buffer from
10:39
here and start reading into that when we
10:42
finish the read that buffer would be
10:45
attached to the end of the appropriate
10:46
queue when you're merging you merge from
10:50
the front of these queues if you run out
10:53
of way if you finish using a buffer then
10:57
you move to the next buffer in the queue
11:00
okay if the output buffer becomes full
11:03
you stop merging you go into a
11:06
synchronization mode all right so to
11:19
start the system we will initialize kqs
11:22
of input buffers each queue will have
11:23
only one buffer in it so you read in one
11:26
block load from each of the K runs that
11:28
is there are to be merged
11:36
right so read in one block load from
11:38
each of the K runs we have a total of
11:42
two K input buffers okay so of the
11:45
remaining K buffers output K minus one
11:48
into the pool of unused buffers I set my
11:53
active output buffer to be buffer zero
11:54
and then I still have one input buffer
11:58
that's not here and that's not been read
12:00
into I initiate a non-blocking read into
12:04
that input buffer from the run that is
12:06
going to exhaust first okay all right so
12:16
that's the initialization step alright
12:24
then I'm going to have a method called K
12:28
way merge which merges from queues of
12:31
buffers into the current active output
12:33
buffer so this fellow is going to stop
12:40
merging when the output buffer becomes
12:42
full or when the end of run marker is
12:47
merged into the output buffer so assume
12:50
that each run has an end of run marker
12:52
which is say a large key value and when
12:55
that's merged into the output buffer
12:57
that signals the end of merging these K
12:59
runs okay so K way merge stops when one
13:05
of these two things happens either the
13:07
output buffer becomes full or you're
13:09
finished with the merging of the K runs
13:18
if this event doesn't occur okay then if
13:26
an input buffer becomes empty you move
13:29
to the next buffer in the queue and
13:30
continue emerging from there okay so if
13:38
this event doesn't occur and if an input
13:41
buffer becomes empty you advance to the
13:44
next buffer in the queue when you
13:46
advance to the next buffer in the queue
13:47
the empty buffer is put into the pool of
13:49
three buffers okay all right questions
13:59
about how the K way merge works all
14:10
right so now to merge K runs okay after
14:15
the initialization I have a loop which
14:17
says do a K way merge okay and this is
14:21
going to stop either when the output
14:23
buffer is full or you're finished with
14:25
the merging of the K runs so this time I
14:34
have a synchronization step I wait for
14:37
any active input output to complete the
14:41
first time through you only have an
14:43
active input because as part of the
14:46
initialization I started a non-blocking
14:48
reading but in subsequent rounds we will
14:52
have active both input and outputs
14:54
taking place okay so I have a
14:57
synchronization step here wait for the
14:59
active input output to complete at this
15:05
time if there was an input taking place
15:10
you have a full input buffer we attach
15:12
it to the queue for its run so you put
15:14
it at the end of that queue
15:19
you determine which run will exhaust
15:22
first we're looking at the last keys and
15:24
all the cues and if there's more input
15:29
for this run it may be that you've
15:31
finished reading all of the blocks for
15:33
all of your K runs okay so if you
15:35
haven't done that then we're going to
15:38
initiate your read of the next block for
15:40
that run okay so so long as there's more
15:42
input you initiated read for the next
15:45
block and I initiate a right of the
15:54
active output buffer okay so these are
15:57
non-blocking reads and writes I switch
16:02
the roles of the output buffers and then
16:11
I repeat and you keep doing this until
16:17
you're finished with merging the K runs
16:26
all right so we first had the
16:28
initialization step you read in the
16:30
first block from each of the K runs you
16:33
take K minus one buffers put them into
16:35
the free pool you take the remaining one
16:38
buffer and initiate a read into that
16:40
buffer using our predictive strategy and
16:44
then we basically repeat this set of
16:46
statements until all K runs have been
16:49
merged you do a K way merge which merges
16:55
just one output buffer load okay so as
17:00
soon as the output buffer is full this
17:02
thing stops or if you're finished with
17:04
all the K runs then also it stops and
17:07
that time we do a synchronize if there
17:10
was an input buffer being read you put
17:11
it into its queue determine where to
17:15
read from next start a read if there is
17:18
more data to read write the buffer that
17:20
was full that was filled out here switch
17:24
the roles of the buffers and then you
17:26
repeat so long as you haven't merged in
17:30
the end of run market
17:40
okay all right questions about how we do
17:43
the mud
17:58
all right so this is the strategy okay
18:01
now we don't really know that this
18:05
strategy is going to work we don't know
18:10
whether it's going to work or not
18:12
because we don't know whether 2 K
18:13
buffers are sufficient for this scheme
18:17
to work in principle if you had an
18:21
infinite supply of buffers and this will
18:23
guarantee to perform the task to see why
18:30
or where things can go wrong
18:35
this is the K way merge method okay
18:39
these were these steps in there okay and
18:42
here there is a step which says that if
18:50
the output buffer is not full and an
18:51
input buffer gets empty you want to
18:53
advance to the next buffer load in the
18:55
queue okay well maybe there is no next
19:00
buffaloed in the queue that became 4 in
19:04
that particular cube okay so this is a
19:09
potential point of failure a buffer has
19:12
become empty you move down to the next
19:15
buffer in the queue but that buffer is
19:17
not there okay so in that case our
19:25
algorithm would fail right these are
19:28
really not points of failure things can
19:30
go wrong here you're merging from the
19:32
active buffers you stop when a something
19:37
becomes full or you've put the end of
19:39
run marker and you can't fill for those
19:42
reasons but this is where you can
19:44
potentially fail okay so that's a point
19:48
of failure in our algorithm and we have
19:50
to establish that whenever you come to
19:54
this point in the algorithm there is a
19:56
next buffer in the queue
20:05
right another place where things can go
20:08
wrong right you already seen something
20:09
could go wrong out here okay another
20:12
place where things could go wrong is
20:15
here okay so over here we say initiate a
20:21
read of the next block do you initiate
20:24
the read of the next plot you must have
20:26
a free buffer so if all of your buffers
20:30
are sitting in queues there's no way for
20:32
you to initiate that read okay there's
20:36
got to be a free buffer in the pool of
20:38
free buffers for you to do this
21:03
or do you see any other potential points
21:06
of failure in the algorithm we've
21:08
described
21:22
you should be able to convince yourself
21:24
that these are the only two potential
21:26
points of failure one is you don't have
21:28
the next block when you need it and the
21:31
second is you don't have a free buffer
21:33
to read into when you need to read okay
21:37
so we need to prove that neither of
21:39
these two conditions can arise all right
21:46
so let's look at the first of these two
21:52
you're doing the K way merge a buffer
21:55
and the Q's become empty and you're
21:57
advancing to the next buffer in its non
21:59
present what we'll show is that if this
22:09
failure were to occur then we will use
22:12
two different analyses to arrive at
22:16
different numbers for the amount of data
22:18
in memory that's showing a contradiction
22:22
okay so we'll arrive at a contradiction
22:25
by using two valid analyses showing that
22:29
the amount of data available to KY merge
22:31
is different from each of these two
22:33
analyses by amount of data available to
22:41
K by merge what we mean is the data in
22:45
the input buffer Q's and the data in the
22:49
active output buffer we don't include
22:54
the data in the buffer that is currently
22:57
being written out and the buffer that's
22:59
currently being read into okay so those
23:03
are not available to care very much
23:04
what's available to KY merges all the
23:07
data in the input buffers and all of the
23:09
data in the current active output buffer
23:12
so that's what we're going to count and
23:16
we'll show that with one method of
23:18
counting we end up with one number for
23:20
this amount of data and with another
23:23
method of counting under the assumption
23:25
of failure we end up with a different
23:26
number and so the assumption of failure
23:30
must be invalid
23:36
all right so that's where we are in our
23:39
code that's K way merge so the first
23:49
time through okay
23:51
we read in K input buffers okay so the
23:56
amount of data available to K way merge
23:58
is there are K input buffer loads and
24:01
the queues and that's about it and so
24:07
the first time through there's exactly K
24:09
buffer loads available to care very much
24:11
okay then the next time through we're
24:15
going to take a buffalo that's full
24:17
we're going to write it out okay but
24:21
we're going to take an input buffered
24:23
who's read was started in the previous
24:25
round and attach it to a queue okay so
24:27
one buffalo goes out one buffaloed comes
24:29
in so the amount of data remains k
24:36
alright so on the second iteration okay
24:40
so let's say we've done the first cycle
24:42
alright so we started with k then an
24:45
output buffer became full you came down
24:47
here wait for i/o to complete we take
24:49
this input buffer and put into the queue
24:51
so now the data becomes k plus 1 okay
24:54
all right determined to exhaust we start
24:56
a read then we started right ok and so
25:00
the output buffer that became full is
25:02
now the is a write buffer okay and you
25:06
switch the roles so the active output
25:10
buffer is empty there are K data loads
25:14
all told then there's no one got added
25:18
out here and one got removed out here so
25:22
the total amount of data is still K okay
25:26
so really what happens is on subsequent
25:29
iterations input and output take place
25:31
at the same rate you write one you read
25:35
one okay so the amount of data stays K
25:37
in here except towards the end because
25:40
towards the end there is no more data to
25:42
read okay so that time you're doing more
25:46
writing and less
25:48
now we couldn't have reached that state
25:51
where you you're doing more writing than
25:54
reading at the time of failure because
25:58
if you're trying to advance to the next
26:02
block okay then you could not have
26:04
output the end of run maka okay because
26:09
the moment you output the end of run
26:11
maka you stop the merge
26:12
so you haven't yet output the end of run
26:15
marker to the output buffer so you
26:17
should have been reading and writing at
26:18
the same right up to this point
26:20
otherwise you shouldn't be advancing
26:22
because our predictive strategy okay so
26:27
input and output are taking place at the
26:29
same rate up to the time of failure and
26:33
therefore you always had K block loads
26:35
available to K way much okay so the
26:40
first line of reasoning says that you
26:43
got exactly K block loads available to K
26:45
way merge at the time of failure all
26:51
right so the next line of reasoning is
26:56
okay and the time of failure the output
27:00
buffer can't be full the end of run
27:02
marker hasn't been merged okay so you've
27:05
got less than one buffaloed in the
27:07
active output buffer there is one queue
27:16
that has become empty the remaining K
27:20
minus one queues cannot have a second
27:23
buffer attached to them because of the
27:26
predictive strategy okay before you
27:28
would get a second buffer for somebody
27:30
you would have got a replenishment for
27:31
this guy so the other fellows must have
27:34
only one buffer in them okay
27:37
and so at worst those each of those one
27:42
buffers could be full so there are K
27:46
minus one of those so in the other K
27:48
minus one runs you have at most K minus
27:51
one buffaloed of data okay because
27:53
nobody can have a backup buffer behind
27:55
them okay so I've got less than one
27:59
buffaloed in the active output
28:01
and at most K minus one buffer loads in
28:04
the input queues for the other K minus
28:06
one runs and if you add that up that's
28:09
less than K buffaloed so the previous
28:19
reasoning said that I got exactly K
28:21
buffaloed x' and this reasoning says
28:23
I've got less than K buffaloed x' and so
28:25
have a contradiction in my analysis so
28:29
the assumption that I made of failure
28:31
must be incorrect okay all right so we
28:37
cannot have a failure of this type when
28:41
we use the predictive strategy or any
28:50
questions about that proof
29:03
okay well I said this proof makes use of
29:06
the predictive strategy and that's
29:08
crucial to not running out of data yeah
29:10
yeah oh this one here right so but by
29:20
definition the total data available to
29:21
the merge is the data in the active
29:23
output buffer and the data in the input
29:25
queues I'm also taking into
29:32
consideration the output buffer yes okay
29:34
so the output buffer is not full so I've
29:38
got less than a buffaloed there I've got
29:41
kqs 1q has become empty that's the one
29:44
that caused the failure the other K
29:46
minus one queues can have only one block
29:48
in them okay you couldn't have bought a
29:51
brought in a backup block for any of
29:54
those queues because of the predictive
29:55
strategy okay so the other K minus one
29:58
queues have at most K minus 1 block
30:00
loads so you add these two together you
30:03
get less than K okay
30:15
all right let's look at the second point
30:18
of failure over here we'll make use of
30:22
the fact that we have two K buffers
30:27
alright so this is the place where we
30:30
initiate a read for the next block and
30:37
we're postulating that we don't have a
30:39
free buffer to read into that means that
30:45
all 2k of our buffers are sitting in
30:47
queues at this time all right so suppose
30:53
there's no free input buffer okay again
30:58
we'll perform two different analyses
31:01
based on this assumption one of them is
31:04
going to show that there's exactly K
31:07
plus 1 buffaloes in memory this is not
31:10
data available to K very much but total
31:12
amount of data in memory at this time at
31:15
the time of failure and the other one
31:18
will show that you got more than k plus
31:19
1 buffaloes again arriving at a
31:22
contradiction and so the assumption that
31:25
there's no free input buffer is
31:27
incorrect
31:32
oh well how does that show that we
31:39
haven't done it yet we have yet to do
31:41
these two analyses okay all right so we
31:44
haven't done the analysis yet yeah okay
31:47
all right so okay
31:57
well this is where we are in our program
32:01
okay so at this time there's no read or
32:05
write taking place everything is
32:06
finished we're done with the
32:08
synchronization that took place over
32:09
here the buffer that you were reading
32:12
has been put into a queue also up to
32:18
this time input and output must have
32:19
been taking place at the same right okay
32:22
if it hadn't that means you'd stop doing
32:25
input at some point you stop doing input
32:27
only if you read in all the data okay in
32:29
which case you won't be trying to read
32:31
in data now okay so up to at this point
32:33
input out what's going on at the same
32:36
right now the first time you get here
32:41
okay so we had K blocks of data that
32:45
were read in we read in another block
32:48
out here k plus one okay so when you get
32:51
here there are total of k plus one
32:52
blocks of data then you initiate a right
32:58
out here you initiate a read so you
33:00
write one block you read one you come
33:02
around again you synchronize one block
33:05
got written one got read so you still
33:07
have k plus one okay
33:10
so up to the point of failure there are
33:12
exactly k plus one blocks of data when
33:15
you come out here right any questions
33:19
about that
33:24
all right let's do the second analysis
33:28
okay so at the time we come to this line
33:35
okay so at this point our active output
33:40
buffer is full okay there's still more
33:43
data so we haven't merged the end of run
33:46
marker you can't have a partial buffer
33:47
okay so the output buffer is full and
33:55
now let's focus on the input queues it's
34:01
possible that one of the queues has an
34:03
empty buffer in it that we haven't yet
34:06
advance to the next buffer in that queue
34:09
because our strategy for K very much
34:11
says as soon as the output buffer
34:13
becomes full you stop and output buffer
34:16
may become full exactly when an input
34:18
buffer becomes empty so we don't free
34:21
that and put buffer for our strategy
34:25
okay so it's possible that one queue has
34:27
an empty input buffer at this time but
34:31
it's not possible that two queues have
34:32
an empty input buffer at this time so
34:36
one queue has an empty input buffer
34:38
possibly the remaining K minus 1 input
34:45
queues have a non-empty first buffer
34:52
then in addition to these K we got one
34:55
empty buffer here of K minus one
34:57
non-empty first buffers here there are K
35:00
other buffers which are sitting in
35:02
queues and those are second third fourth
35:04
buffers those have to be full buffers
35:19
right so if you add up okay we got K out
35:24
here we got one out there so that's K
35:27
plus one and since K is at least two you
35:31
got at least one non-empty first buffer
35:32
here okay so you have more than k plus
35:36
one I'm not saying that K minus 1 input
36:05
queues are empty or not empty in these
36:10
two statements or lines I'm making a
36:14
statement about the first buffer in each
36:16
of the kqs okay
36:18
so each queue has at least one buffer in
36:21
it and potentially more so I'm looking
36:25
only at the first buffer in each of the
36:27
K cues and I'm saying that one of these
36:29
first buffers may be empty the other K
36:33
minus 1 first buffers cannot be empty
36:35
okay in addition to these K buffers
36:39
since I've got a total of 2 K buffers in
36:42
my system input buffers there are
36:44
another K buffers none of these are in
36:48
the pool of free buffers so these K must
36:51
be in the queues okay and if they're in
36:54
the queues they have to be backup
36:55
buffers which must be 4 okay and this is
37:00
that's where this line comes from which
37:02
says the remaining K input buffers these
37:04
are backup buffers they have to be folk
37:06
and so you got K full buffers in the
37:08
queues
37:09
[Music]
37:13
that's why these kr non-first buffers
37:15
okay okay but we don't know which queue
37:19
they're in right okay so I've got K of
37:22
these non-first buffers that are full I
37:24
got this one that's full that's K plus 1
37:26
that one's empty doesn't do me any good
37:28
and I've got at least one buffer here
37:30
that is not empty
37:32
so I've got at least one record more
37:34
than k plus 1 buffers okay so I've got
37:38
more than k plus one buffer loads and
37:40
that contradicts what I had on the
37:42
previous slide which said I've got
37:43
exactly K plus 1 buffer works ok so if
37:49
you have two K buffers then you cannot
37:52
have this kind of failure okay so if you
37:57
use a predictive scheme you can't get
37:58
the first kind of failure if you have
38:00
two K buffers you can't get the second
38:02
kind of failure because here okay we
38:18
want to initiate a read of the next
38:19
block and the only way this can fail is
38:23
if the pool is empty okay so the
38:27
assumption is like here no free input
38:32
buffer
38:38
all right other questions
38:47
alright so using the predictive scheme
38:49
and 2k buffers is sufficient to have our
38:53
mood strategy work all right there to
39:06
minimize the wait time what you would
39:07
like of course is that the time to fill
39:09
the output buffer should be roughly
39:12
equal to the time to read a buffaloed
39:14
should be roughly equal to the time to
39:16
write a buffaloed of course in practice
39:19
you don't have any control over this
39:21
because your block size is determined by
39:24
the disk format and the time to fill
39:28
above was determined by your CPU speed
39:33
so even if you change K this doesn't
39:41
really change because if you use a
39:42
looser tree there we saw that the time
39:44
to merge is insensitive to the order of
39:46
the merge
39:48
so this is what you'd like to have but
39:51
you don't have a whole lot of control
39:52
about adjusting parameters to make those
39:55
equal now typically you're going to be
40:06
performing many k way merges so at each
40:09
internal node of the Huffman tree you'll
40:11
be doing in K way merge the way we've
40:14
written the merge we have to first read
40:17
the first block of each of the K runs
40:19
that are being merged and so at that
40:22
time you're only doing input you're not
40:26
the overlapping input with any output or
40:28
internal merging now you can arrange so
40:34
that for subsequent K way merges you
40:37
actually read the first block of the
40:38
runs early so you don't have this period
40:42
the startup period and for that what you
40:46
do is you change this if statement in
40:50
our merge program okay in that loop we
40:55
change it to say
40:58
if you have more input from the run well
41:00
then you certainly go and get the next
41:02
block but if you don't that would mean
41:06
that you've read in all the records or
41:08
all of the blocks for this current que
41:10
way merge okay because of the predictive
41:14
scheme if there is nothing from the run
41:18
that's going to exhaust first there's
41:19
nothing from any of the other runs okay
41:22
so in that case you initiate a read of a
41:28
block for the next highway merge so that
41:36
way it's only the first que way mode you
41:38
do that has the startup cost where you
41:41
have to read in K blocks while you
41:43
aren't doing in your writing or merging
42:02
all right any questions
42:15
all right Google stop here then