Transcript


00:06
okay alright last time we started to
00:10
talk about external sorting and I guess
00:13
we're going to continue with that with
00:17
the discussion of external sorting
00:18
actually for several lectures before I
00:21
continue do you have any questions on
00:23
anything we've done so far yeah I guess
00:45
the question is last time when I spoke
00:46
about caches where multiple level of
00:48
caches and I didn't take into account
00:51
the time needed to move data from one
00:53
cache to the register to perform the
00:55
arithmetic well then the numbers I had
00:59
up there
01:00
you can either interpret that as saying
01:02
say if I said it's 100 cycles to get
01:06
data from memory to l2 okay you could
01:09
interpret 100 as in 100 cycles you can
01:13
get the data from memory to l2 and also
01:16
into l1 and the registers okay
01:20
alternatively you can add the 10 cycles
01:24
to go from let's say l2 to l1 and then
01:28
whatever it was it's not going to make a
01:30
whole lot of difference okay so in my
01:33
analysis I just assumed that the hundred
01:35
would cover everything you could get it
01:36
from memory all the way down to the
01:39
register
01:48
like what about the searching time
01:50
required what they are that you can
01:52
resume assault taking a discount
01:56
irrespective of the searching that can
01:58
be used over there well again you know
02:02
we did an example to explain how things
02:04
work we didn't go into the circuitry and
02:07
whether you're using an LRU method to
02:11
invalidate caches and stuff okay we
02:14
didn't go into that level of detail okay
02:16
the purpose of that discussion was to
02:18
show that there are caches it costs time
02:22
to get data what it cost depends on
02:25
whether you're getting it from l1 cache
02:27
or l2 of memory and the techniques used
02:30
for external sorting therefore can be
02:33
adapted to the internal computations
02:37
okay we certainly did not go into a
02:41
level of detail to explain the different
02:44
cache replacement strategies the
02:46
different cache access mechanisms and so
02:48
on okay so it doesn't it's kind of
02:53
futile to think about does that include
02:56
the details of this search does it
02:58
include this particular mechanism or a
03:00
doesn't all right all right other
03:06
questions
03:09
now the the lectures are being recorded
03:15
as you've noticed and they are available
03:17
through streaming video from the
03:19
Internet in case you miss one of the
03:22
lectures or you do want to go back and
03:24
review a past lecture I've updated the
03:27
link in our course website to get you to
03:30
the place where the lectures are
03:32
available and to get to the recorded
03:36
lectures you will need to login using
03:38
your gatorlink accounts okay all right
03:42
so as I said we're going to continue
03:45
with our discussion of external sorting
03:48
and as I said last time a good strategy
03:53
to develop an external sort is to adapt
03:57
what you already know from internal
03:58
sorting so you take the best internal
04:00
sort methods you're aware of and adapt
04:03
them to the external sort environment
04:05
you know so with that as a strategy we
04:09
took a look at the best sort method we
04:12
knew for average performance quicksort
04:14
and we saw how to adapt that and that
04:17
adaptation in reveal the need for data
04:21
structures to represent a double-ended
04:22
prior to cue it also introduced the
04:25
concept of buffers portions of memory
04:27
into which you could read a block of
04:29
data from your disk or portions of
04:31
memory into which you can assemble a
04:33
block of data that you'll eventually
04:35
write to a disk and it also revealed the
04:40
idea of trying to do things like
04:43
input/output and internal processing in
04:45
parallel the simple discussion we had we
04:49
had to stop working when an input buffer
04:51
became empty or we had to stop working
04:54
on one of the output buffers became full
04:56
but then toward the end I said if you
04:58
had double buffers two input buffers two
05:01
output buffers for the left group and
05:04
two output buffers for the output group
05:06
then while you were filling one input
05:10
buffer you could be using or consuming
05:12
from the other input buffer while you
05:14
were writing one output buffer you could
05:16
be filling the other one from internal
05:21
processing
05:22
okay all right so so that discussion
05:25
introduced several new concepts all
05:30
right today we're going to take a look
05:32
at Mozart which is the best sort method
05:35
for from the standpoint of worst case
05:37
performance I will see how we can adapt
05:41
that to the external environment and see
05:44
what additional data structures may be
05:46
needed to support an external sort all
05:51
right so let's first take a quick review
05:54
of internal merge sort internal merge
05:59
sort essentially works in two phases or
06:02
two steps in the first phase you create
06:05
sorted segments okay so for example if
06:10
you're using something called a natural
06:11
mode sort then the naturally occurring
06:14
sorted segments in your collection of
06:17
Records to be sorted are used okay so
06:20
you simply sweep your input file from
06:22
left to right and wherever there is a
06:24
decrease in key that marks the end of a
06:26
segment another way to create the
06:31
initial sorted segments is to disregard
06:33
the naturally occurring sorted segments
06:35
and to just take blocks of some fixed
06:38
size maybe you take 20 records at a time
06:41
sort them using internal sort and then
06:43
you get segments of size 20 that are
06:45
sorted okay so you start by creating
06:50
sorted segments and those are two common
06:54
schemes used to create the initial
06:56
sorted segments and then in the second
06:59
phase you merge together pairs of sorted
07:02
segments so when you merge two segments
07:06
together you replace those two with one
07:08
you repeat this process you'll be left
07:11
with a single sorted segment and this
07:13
second phase actually works in something
07:16
called a pass okay so if you started
07:19
with ten sorted segments at one pass you
07:22
reduce the ten segments to five then the
07:25
next pass you reduce the five to three
07:27
then the three to two and the two to one
07:30
okay all right so that's basically how
07:33
an internal merge sort works
07:36
now we're going to try and do an
07:40
external merge sort in essentially the
07:43
same manner have two phases in the first
07:46
phase we will create sorted segments in
07:48
the context of external sorting will
07:51
refer to a sorted segment as a run and
07:53
then the second phase will merge
07:55
together pairs of runs in passes all
08:02
right so let's illustrate how this works
08:05
by looking at an example so in our
08:07
example we want to sort 10,000 records
08:11
however the computer we have has
08:14
sufficient memory to hold 500 records
08:20
will assume that the block size is 100
08:25
and then for the analysis we'll make use
08:30
of some variables to denote how much
08:32
time different tasks take so T i/o will
08:36
represent the amount of time it takes to
08:38
either read in or to write out one block
08:41
of data so that includes the seek time
08:43
and the latency time as well as the
08:45
transmission time again the analysis for
08:49
example could assume the worst case
08:51
scenario where you paid the maximum seek
08:54
so you're moving from one end of the
08:56
disk all the way to the other end of the
08:57
disk the maximum latency you have to
09:00
wait for a full rotation to take place
09:04
we will make use of T is that's the time
09:07
to internally sort a memory load of data
09:09
so in this case will be the time to sort
09:12
500 records similarly T i/o is the time
09:17
to input 100 records one block then
09:22
we'll use GIM as the time to internally
09:25
merge a memory load so time to
09:29
internally merge one block load of data
09:32
okay so to create 100 records in the
09:36
output file
09:42
all right so there's two phases you have
09:45
the run generation phase and a run is
09:48
just what we were calling a sorted
09:50
segment before and then the run merging
09:53
phase which is what was phase two of the
09:57
internal sort where we will merge
09:59
together pairs of runs until we left
10:01
with a single run all right so let's
10:06
look at the first phase that's run
10:08
generation so here's my memory it's got
10:11
enough capacity to hold 500 records and
10:14
when I say it's got enough capacity to
10:15
hold 500 records what I really mean is
10:17
you can hold the 500 records plus a
10:19
program plus other variables so that you
10:23
can run the program okay but basically I
10:26
have the memory to sort 500 records
10:28
using some internal source scheme so 500
10:35
records represents five blocks of data
10:39
I've got 10,000 records sitting on disk
10:42
and that represents 100 blocks of data
10:48
all right so the strategy is going to be
10:50
you read in five blocks so that fills up
10:53
your memory you sort the five blocks
10:58
using say an internal merge sort and
11:04
then you write out these five blocks as
11:07
a sorted segment or a run okay so this
11:15
strategy is going to create runs whose
11:17
size is five blocks long all right so
11:20
this creates one run and then basically
11:23
we're going to repeat this thing 20
11:25
times and the result is going to be 100
11:32
sorry the result is going to be 20 runs
11:34
each run being 5 blocks long
11:41
any questions about the strategy to
11:42
create runs you see that we have enough
11:45
memory to do this alright so let's look
11:50
at the time it takes so you want to
11:53
input five blocks okay so that's five TI
11:56
o okay so this is the time for iteration
12:00
of this loop here okay so five tio2
12:03
input those blocks you want to sort so
12:06
that's one T is time to sort of memory
12:09
load then you want to output those five
12:11
blocks so that's another five TI o okay
12:16
so to create one run takes you ten TI o
12:19
plus T is then we're going to do this 20
12:23
times so it becomes 200 TI o plus 20 t
12:27
is all right so using this strategy we
12:33
can create runs each run is five blocks
12:36
long and the amount of time we're going
12:39
to spend is 200 TI o plus 20 t RS okay
12:49
we in agreement on life okay
12:55
well let's go to run merging okay second
12:59
phase so again we'll use the concept of
13:02
a merge pass and in a merge pass we will
13:05
read in all of their runs and reduce the
13:08
number of runs by a factor of two all
13:13
right so the strategy is then to
13:16
pairwise merge these 20 runs bring it
13:19
down to 10 we'll see in a moment how we
13:22
actually accomplish this but at a high
13:24
level the strategy is to do this
13:26
pairwise merge the 20 runs into 10 ok
13:31
and we use the term merge pass to
13:35
represent an amount of or a process in
13:38
which all the current runs are read in
13:40
merged and the number of runs you are
13:42
left with is one half of that and then
13:48
we're going to repeat this process four
13:51
times so they have a picture here to
13:55
show what's going to happen all right so
13:57
I start with 20 okay so in the first
14:00
much but when t2 10 by merging together
14:04
pairs of runs
14:06
so the ours become the esses so each R
14:12
is 5 blocks long now each s will be 10
14:15
blocks long in the second pass I will
14:19
merge the esses in pairs to get the t's
14:22
and the number of T's will be 5 so each
14:28
s is 10 blocks long each T is going to
14:30
be 20 blocks long in the third pass
14:35
pairs of T's will be merged to get the
14:37
use and since you have an odd number of
14:39
T's the last one T 5 will not have a
14:43
pair to merge with so we just copy that
14:45
over as a you
14:51
and so now the number of runs reduces
14:53
from five to three the T's were 20
14:59
blocks long the use will be 40 blocks
15:01
long except the last one which is only
15:03
20 blocks long I have another round of
15:09
Mojang's another merge pass the use will
15:12
become V's from 3 L go to 2 and the V's
15:21
are now 80 blocks long except for the
15:23
last one which is only 20 and then a
15:27
final merge pass in which the vs will
15:30
get replaced by a single W so in the
15:36
second phase I merge together pairs of
15:38
runs the merging is done in passes each
15:42
pass reduces the number of runs by a
15:45
factor of 2 roughly at the same time it
15:48
doubles the size of a run all right so
15:53
that's the overall strategy and that's
15:56
the same as what's done in an internal
15:58
in the second phase of an internal mode
16:00
sort how do we merge a pair of runs
16:17
because each run is of size larger than
16:20
the capacity of memory and and that's
16:23
something we're to look at in a moment
16:24
so this point I just wanted to make
16:26
clear how we plan to do this at a high
16:28
level and so what to implement this as
16:31
you've correctly noted we need to figure
16:33
out how do we implement one of these
16:35
pairs either here or there or there okay
16:38
at each node of this merge tree how do
16:40
we actually do that much okay and that's
16:42
what we want to see right now all right
16:47
other questions about the overall
16:49
strategy
16:53
all right so let's take a look at how we
16:56
might merge together a pair of runs all
17:02
right so here's our memory configuration
17:06
I'm going to partition memory into
17:10
buffers like I did for quicksort okay so
17:15
I'll create a designate a portion of
17:19
memory as an input buffer input 0 and
17:22
this will be one block long so 100
17:26
records of memory and it's going to be
17:28
used to input run 1 one block at a time
17:35
then I'll have another portion of memory
17:37
designated as an input buffer call it
17:39
input 1 and that's going to be used to
17:42
read in run to one block at a time
17:45
I still have 300 Records worth the space
17:52
of take another 100 records worth of
17:55
space and does it make that as an output
17:57
buffer in which I will assemble a block
18:01
of s1 okay so this will be used to bring
18:07
in our one one block at a time that'll
18:09
be used to bring in our to one block at
18:11
a time
18:12
internal merge will assemble s1 here one
18:16
block at a time if this becomes empty
18:21
will replenish this with the next block
18:23
from r1 if this becomes empty will
18:27
replenish this with the next block from
18:29
r2 if this becomes full I will output
18:32
this block of s1 to disk and once I'm
18:34
done with the output I can continue to
18:36
assemble the next block of s1 all right
18:42
so as I said this service is our one
18:48
input one service is our two and this
18:51
collects s1 to start the process I've
18:57
read in the first block of r1 I read in
19:00
the first block of r2 to do the merge I
19:04
have to look at the first records in
19:06
these two blocks the smaller one
19:07
moves here okay then you look at the
19:12
first unmoved record from here and there
19:14
and the smaller one moves there okay and
19:17
you keep doing this until one of these
19:19
becomes empty or this one becomes full
19:23
okay so when a buffer becomes full you
19:30
write it out so you have to stop
19:31
generating stop internal merging you
19:35
write it out when a buffer becomes empty
19:37
you have to stop internal merging you
19:40
get the next block from the disk alright
19:45
so so long as you have enough memory for
19:47
three blocks of data you can merge
19:51
together two runs independent of how
19:55
long the runs are
20:05
all right any questions about how we
20:08
merge together a pair of runs
20:10
independent of their size provided
20:13
you've got three blocks of memory yeah
20:21
[Music]
20:27
let's see if the output buffer becomes
20:30
full and you got half here and half
20:32
there well that's not a policy you got
20:35
half in half here you leave that okay at
20:37
this time we're just going to write this
20:39
one out okay
20:40
once you finish writing it then we'll
20:42
continue with this half and that half
20:44
okay so there's no requirement that all
20:48
three events happen at the same time
20:50
that this gets full and these two become
20:52
empty at the same time okay all right
20:55
all right another question there was
20:58
something coming up somewhere no all
21:05
right let's look at how much time it
21:08
takes all right each of r1 and r2 is
21:13
five blocks long to do that merge I just
21:19
described all five blocks of r1 and r2
21:21
are read in there are different times so
21:24
the total input time is 10 T i/o all are
21:29
written out there are different times
21:32
one block at a time so there's ten
21:34
blocks of writing so that's ten T i/o
21:36
and then all ten are internally merged
21:41
from input buffers to output buffers so
21:43
that's 10 T I am
21:51
so it takes 20 iOS and 10 IMS to merge R
21:58
1 and R 2 into s 1 that includes the
22:02
input the output the internal processor
22:09
alright so the time for the first pass
22:13
ok so in the first pass I'm going to do
22:17
this thing 10 times ok first R 1 R 2
22:20
then R 3 R 4 and finally our 19 out of
22:23
20 so I wouldn't do it 10 times so I'll
22:27
spend 10 times that much time alright
22:35
then we're going to do the next pass so
22:39
that's going to merge the SS into T's
22:41
same strategy as before except now each
22:46
s 1 is 10 blocks long so to read in s 1
22:51
is 10 T IO to read in s 2 is 10 T IO in
22:54
the merge all 20 blocks will get right
22:57
in there are different times so the
23:00
input time is 20 t io the output time is
23:04
the same as the input time everything
23:06
you read then you got to write out
23:08
everything you read then has to be
23:10
merged from input to output buffers so
23:13
20 TR m so the total time to do a pair
23:18
when you're going from s to T's is 40 I
23:22
Oh s and 20 I am s
23:32
you had to go from s to T's there are 10
23:35
SS so we'll have five pairs of pairwise
23:39
merges then it becomes five times that
23:41
so that's 200 TI o plus hundred TI m
23:45
same as what we had when we went from
23:47
ours to SS but in general in a merge
24:00
pass you read in all of the data
24:04
guess there's hundred TI oh okay you
24:08
write out all of the data those 100 TI o
24:11
and you move all the data from input
24:13
buffers to output buffers so that's 100
24:16
TI m so independent of which pass your
24:21
rent you spend roughly the same amount
24:24
of time 200 TI o plus hundred TI m i say
24:27
roughly because if you have an odd
24:29
number of fronts then this last run
24:31
maybe handle a little bit different but
24:37
approximately each pass takes the same
24:39
amount of time all right so the total
24:49
time you spend merging together runs is
24:52
the time for a merge pass okay so that's
24:56
time to read in all the records write
24:58
out all the records move all the records
25:00
from input to output buffers times the
25:02
number of passes that you perform and
25:07
the number of passes that you do is log
25:11
of the number of runs that you started
25:13
with okay so in our case we started with
25:15
20 runs so it becomes log of 20 base -
25:20
you take the ceiling of that should work
25:22
out to about 5 ok so you have 5 passes
25:25
and this is the time for a single pass
25:35
alright so the strategy that's used for
25:39
internal mergesort can be adapted to an
25:42
external mood sort the only requirement
25:46
here is that you should have enough
25:48
internal memory for three blocks or any
25:55
questions about the adaptation and
25:58
whether or not you in fact can use the
26:00
strategy to sort arbitrarily large
26:03
collections of Records alright so let's
26:16
take a look now at the factors that
26:18
contribute to our external mergesort
26:21
time and then we can focus on these
26:24
factors to see if there's a way to
26:25
reduce the components that make up the
26:28
total time all right so run generation
26:35
took this much time and the components
26:38
there this is as the internal sort time
26:41
time to sort one block of memory there
26:46
is the input and output time that's the
26:48
200 T i/o that we're spending as far as
26:55
run merging went we have the time for a
27:03
pass times log of the number of runs
27:07
base two so the internal much time is a
27:14
contributor T I am how much time does it
27:16
take to move a block from input buffers
27:20
to output buffers the total time that
27:23
you spend in putting and outputting it's
27:27
a 200 T i/o times log of 20 bays to the
27:34
number of initial runs in this case 20
27:36
so if you started with a smaller number
27:39
of runs you'd have to make fewer passes
27:46
okay and the other factor here is that
27:49
base two in the logarithm okay we have a
27:53
base two in the logarithm because we
27:55
were merging pairwise so that's the
27:58
order of the merge we're doing a two way
27:59
merge conceivably we could try to merge
28:02
say four runs together into one instead
28:06
of two into one and that would reduce
28:09
the number of passes all right so we've
28:16
seen the factors let's go back and look
28:18
at these factors again and see what we
28:20
might be able to do to improve or reduce
28:22
the total time spent both in run
28:25
generation and then merging all the runs
28:29
together the objective of course is to
28:31
reduce the overall time okay we don't
28:35
really care how much time is spent in
28:36
the first phase and how much is spent in
28:38
the second phase individually but the
28:40
sum of the two is what we want to reduce
28:44
all right so to improve one generation
28:47
one component there was the total time
28:52
spent input and output okay so we could
28:57
try and or in fact if we look at the
29:00
formula we had the input time to that we
29:03
added the output time to that we added
29:05
the time spent internally sorting okay
29:08
so we could think of trying to do these
29:10
three activities concurrently rather
29:14
than see roli so it shouldn't be that
29:16
while you're inputting you can't be
29:18
outputting while you're outputting you
29:20
can't be generating part of the output -
29:24
for the next round of writing okay so we
29:28
might try and see if there's a way to do
29:31
these three activities in parallel or at
29:34
the same time overlap the time spent in
29:36
those three so in that case instead of
29:39
adding the time spent on input to the
29:42
time spent and output to the time spent
29:43
internal sorting maybe we'll be taking
29:46
the max of these three quantities okay
29:50
and certainly in order to overlap your
29:54
hardware has to support
29:56
doing activities in parallel alright so
30:01
we'll look at the possibility of
30:02
overlapping those activities okay
30:05
now another component we put down there
30:07
was the time to sort a block load of
30:10
memory it doesn't look like you're gonna
30:13
be able to do much there because that's
30:14
internal sorting and we already starting
30:16
with we said the best internal sort
30:18
scheme our component that we can work on
30:24
here which didn't show up in the time
30:29
needed for run generation but effects
30:32
the second phase that's the number of
30:33
runs so when we looked at the factors
30:38
for the second phase number of runs that
30:40
you started with as a factor and that's
30:42
determined by this fellow here okay now
30:47
in the strategy we've used right now the
30:51
number of runs were simply total number
30:52
of Records divided by capacity of your
30:54
memory but we may think of alternate
30:59
strategies to generate runs where
31:03
perhaps you can end up with a smaller
31:04
number of runs all right so so we want
31:13
to look at two things one is overlapping
31:14
okay as far as overlapping is concerned
31:17
I said it depends on with your hardware
31:19
support set a minimal hardware
31:22
configuration good it need is there
31:24
should be two disk drives around you
31:28
can't be reading and writing from the
31:31
same disk at the same time because the
31:33
head cannot be at two different places
31:36
okay so you're going to need two
31:38
different disk drives you can be reading
31:41
from one and writing to the other also
31:46
your hardware needs to support what we
31:48
call a non-blocking read and a
31:51
non-blocking write but non blocking I
31:53
mean that while you are writing a buffer
31:56
load here you that should not block all
32:01
of the other operations of the computer
32:03
okay so I should be able to initiate a
32:06
command which says write this block as
32:08
soon as the command is initiated you go
32:10
to the
32:10
next instruction which could be read a
32:13
block from disk and when that is
32:16
initiated you can continue with internal
32:19
arithmetic logical operations okay so
32:23
we're going to assume that you have two
32:24
disks and also that you have
32:27
non-blocking reads and writes
32:47
right this is going backwards
33:04
okay alright then said we want to be
33:09
able to generate a smaller number of
33:13
runs then by this simple strategy we
33:15
just discussed and that's the same as
33:17
saying we want to generate longer runs
33:19
than the capacity of the memory and
33:21
we'll see in the next lecture that well
33:26
perhaps not a little expert in a
33:28
subsequent lecture that by changing our
33:31
strategy we in fact can generate runs
33:33
whose length on average is two times the
33:35
size of memory okay so we can reduce the
33:39
number ones by a factor of two on
33:42
average but let's go to the second phase
33:49
yeah like our and our ten because
34:37
right the the comment is that if instead
34:42
of merging say r1 and r2 together we
34:44
started by merging r1 and our 10 that
34:47
would have an impact on the time it
34:48
takes to compare a pair of keys in the
34:51
merge and perhaps you could reduce the
34:55
run time in that fashion you know two
34:57
things to keep in mind with respect to
34:59
that is one is that there is no reason
35:02
to believe that the keys and r1 are a
35:07
whole lot different from the keys in our
35:08
10 versus those in r2 because there's no
35:12
relationship between r1 r2 r3 in our 10
35:15
you started off with a collection of
35:17
Records and there's no reason to believe
35:20
that records that are that have similar
35:22
keys are closer to each other than
35:24
records that have different keys okay we
35:26
start with an unsorted sequence okay all
35:28
right so therefore whether you do r1 and
35:30
r2 or r1 and r3 or r1 or 10
35:33
statistically you don't expect any
35:35
difference okay now the second thing
35:38
which has to do with actual numbers I
35:41
mean suppose you had two numbers and you
35:47
wanted to compare them so if the two
35:50
numbers were let's say they differed in
35:53
the most significant bit versus being
35:56
the same in the first 15 bits and then
35:58
differing on the 16th but is there a
36:00
difference in how much time it takes
36:01
that would of course depend on what
36:03
compact our comparison architecture or
36:06
hardware you've built what is the
36:08
comparison algorithm are you doing a
36:09
bitwise compare in series because no
36:12
computer does that okay otherwise the
36:16
time to do a compare would increase as
36:19
you increase the word size okay so and
36:23
so even that's really not true though
36:26
there is variation in the time it takes
36:28
to do a compare the variation is not
36:31
that large it's not like going from 1 to
36:34
a factor of 32 if you have a 32-bit
36:36
machine ok ok so
36:42
let's go to the second phase which is
36:43
the run merging phase so here - we can
36:47
think about trying to do the three
36:52
things input output and internal merging
36:54
in parallel overlap the time spent in
36:57
these rather than in our simple scheme
36:59
where when you were in putting a
37:01
buffaloed you could not be writing you
37:03
couldn't be internal merging okay
37:06
certainly to do this you're going to
37:08
need a lot more buffers and you're going
37:10
to need a buffer management strategy all
37:17
right so as was the case for the first
37:19
phase you're going to need the same kind
37:21
of hardware you're going to need two
37:23
drives so that you can read and write at
37:25
the same time you're going to need non
37:27
blocking read and write and we make the
37:30
assumption that the hardware is going to
37:32
support this okay all right then the
37:38
number of passes was important this was
37:41
determined by the order of the merge our
37:43
example did a two way merge we could
37:45
think of doing a higher-order merge okay
37:47
in general the number of passes is log
37:51
base K if you do a K way merge so for
37:56
example with our twenty runs if we did a
37:58
five way merge then going from ours to
38:02
ss-20 ours would reduce to four s's okay
38:06
so we do five sets of ours of course to
38:13
do this we will need five input buffers
38:15
and an output buffer in a minimal
38:20
configuration using the simplistic
38:22
scheme that we discussed so you got to
38:25
have enough memory to do that or you
38:27
have to be able to reduce the block
38:28
sauce alright so in this case we just
38:33
have two passes the second pass would
38:36
become a four-way fingers you don't have
38:38
5s is too much
38:48
right so if you reduce the number of
38:50
passes then the benefit you get isn't
38:57
corresponding to the reduction in number
39:01
of passes necessarily okay
39:03
so let's do a little bit of an analysis
39:05
here okay as I said if you're going to
39:07
do a keyway merge then the number of
39:10
buffers you need you need K input
39:12
buffers plus you need one output buffer
39:15
okay so the number of buffers goes up
39:18
linearly in the order of the merge but
39:23
the size of your memory is fixed so at
39:25
some point when you increase K beyond
39:28
that point you're going to have to
39:30
reduce the block size now let's assume
39:32
your operating system lets you reduce
39:34
the block size once you reduce the block
39:43
size the number of blocks that have to
39:45
be read in and a pass goes up so if a
39:51
block is going to be smaller the number
39:52
of blocks goes up so each pass reads
39:54
more blocks in which case there are more
39:56
Sikhs and Layton sees that you had to
39:58
pay so the input time per pass goes up
40:03
the number of passes may go down where
40:05
the input time per pass will go up so
40:10
you end up looking something like this
40:12
this is a time per pass okay initially
40:16
as you go from K equals 2 to 4 to 8 to
40:18
16 you don't see any change in the time
40:21
per pass because the block size is
40:24
sufficiently small that you can
40:26
accommodate the increased number of
40:28
buffers
40:28
in your memory ok but at some point you
40:32
have to reduce the block size in which
40:35
case the number of blocks you read in
40:37
per pass goes up and so the input time
40:42
or output time goes up per pass
40:50
so the question then is well all right
40:53
so we see that the the i/o time is going
40:55
to go up or pass after a point what does
40:58
this mean to the total time for the
41:00
second phase so we need to look at the
41:06
total time okay is given by the i/o time
41:12
for one merge pass ok total i/o time is
41:15
IO time for a pass times the number of
41:19
runs you take the log of that base okay
41:23
okay so initially in the previous graph
41:29
we had a flat portion okay so this time
41:33
is flat but this is a reducing thing the
41:37
number of passes is going down so you
41:40
see a reduction in the total IO time
41:42
okay after a point even though this is
41:45
going down okay you increase in case you
41:49
log of number of runs base K is still
41:52
going down but that fellow is going up
41:53
and the net effect is an increase in the
41:56
IO time so you can get a reduction in
42:02
total IO time up to some critical value
42:05
of K what happens to the internal time
42:15
spent doing the merges as you increase K
42:18
a simple way to do the merge when you go
42:25
to a higher order merge is to just
42:28
extend what you were doing for a two-way
42:30
merge in a two way merge you would
42:31
compare the first two keys and move the
42:34
smaller one out there okay so now in
42:38
this case I'm doing a six-way merge I
42:39
got six records I'll do five comparisons
42:43
amongst the six front keys and these
42:46
buffers and the smaller one goes out so
42:52
doing K minus one compares you can find
42:56
the smallest of K items and push the
42:59
smallest one to the output buffer okay
43:05
so your time to merge okay in a pass if
43:10
you start with end records in our case n
43:12
was 10,000 all N will end up in the
43:15
output buffer okay the the number of
43:18
comparisons needed to move a record from
43:20
the input buffers to the output buffers
43:22
is K minus 1 and all of them go that way
43:25
so it's K minus 1 times n compares and
43:28
the C factor there
43:29
converts that compares to say
43:32
microseconds or something okay so that's
43:37
the amount of time it takes to do a
43:40
merge pass internally you multiply that
43:46
with the number of merge classes so
43:49
that's log of the number of runs number
43:53
of runs is hour so log of our base K and
43:58
then you change the log of our base K
44:02
replace it with log of our base 2
44:04
divided by log of K base 2 so you end up
44:08
with this expression here on the right
44:10
and if you look at this thing the
44:12
important term is this one here K
44:14
divided by log of K base 2 which tells
44:17
you that as you increase K the internal
44:20
much time goes up it's not going up
44:23
linearly in cables pretty close to that
44:25
it's K over log of K base 2
44:34
okay alright so what we've seen is that
44:36
one as you increase K beyond a certain
44:40
point you would either have you have to
44:42
reduce the buffer size so you you want
44:45
to be you want to kind of conserve the
44:48
number of buffers if you're trying to go
44:50
into any sophisticated buffer management
44:52
scheme okay to is as you increase K the
44:57
internal much time goes up okay total
45:02
amount of time that you spend merging so
45:04
that's not good either okay now we can
45:06
remedy this by changing the strategy
45:09
used to do a K way merge we use a data
45:14
structure called a tournament tree so in
45:16
between our runs and the output buffer
45:18
you put a tree which is used to select
45:22
the next fellow that moves into the
45:24
output buffer we will see that when you
45:28
use a tournament tree the time required
45:32
to move a record from an input buffer to
45:34
the output buffer instead of being of
45:36
the order of K becomes of the order of
45:39
log K so if you have to move n records
45:44
from input buffers to the output buffer
45:46
as you do in a pass then the time
45:49
becomes D times n log K D some constant
45:52
factor which converts comparisons to
45:54
microseconds so we get a get an
46:00
improvement here previously we had K
46:03
minus 1 as our factor here that drops to
46:06
log of K and what that does is when you
46:11
plug it into the total time as we did
46:15
before the K vanishes from the total
46:19
time okay so time per pass multiplied by
46:24
the number of passes log of our base K
46:26
and then we replace log of our base K
46:29
with log our base 2 divided by log K
46:32
base 2 the Lord KS cancel out and you
46:35
get just a n log our base 2 term ok so
46:42
if instead of using the simple strategy
46:45
too much you make use of a tournament
46:47
tree we can make the time for the
46:50
internal merging portion insensitive to
46:54
K now you do have a penalty when you
46:57
switch from the simple scheme to this
47:00
scheme so if you tried this scheme with
47:01
K equals 2
47:03
there would be a overhead penalty
47:06
because the D here is bigger than the C
47:08
that we're using there but once you make
47:10
the switch it doesn't matter whether
47:12
you're dealing with a K of 8 or a K of
47:14
16 or okay of 32 it all takes the same
47:17
amount of time all right so to implement
47:25
merge sort effectively we need to have a
47:29
good buffering strategy we don't want to
47:32
use more buffers than needed the good
47:36
buffering strategy should support
47:37
parallel input output and internal
47:40
operations we want to use tournament
47:43
trees to do the K way merge alright so
47:48
next time what we'll do is we'll start
47:49
by looking at tournament trees and then
47:51
see how they can be used both for one
47:54
generation okay give us runs whose size
47:58
and average is two times that of memory
47:59
and also for our K way merge and then we
48:03
look at buffering strategies okay