Transcript


00:12
okay all right do you have any questions
00:15
you want to ask before I get started
00:17
today
00:25
I guess some of you had indicated a
00:27
problem registering for the course last
00:30
week guess as of this morning there are
00:32
opening so if anybody hasn't register
00:35
and would like to you should be able to
00:37
go into the registration system and do
00:39
that okay all right now we're going to
00:43
start talking about about data
00:47
structures for various applications and
00:53
the first set of data structures I'm
00:55
going to talk about motivated by an
00:57
application to external sorting so in
01:03
today's lecture I'm going to just
01:05
briefly describe the external sorting
01:07
problem and describe an adaptation of
01:10
quicksort to the external sort
01:12
environment and see how this adaptation
01:15
motivates the need for a new data
01:18
structure in the next lecture we'll take
01:20
a look at an adaptation of merge sort to
01:22
external sort or to the external sorting
01:25
environment and see that that adaptation
01:27
calls for yet other data structures and
01:29
then we'll have several lectures in
01:31
which we will study data structures that
01:32
were motivated by external sorting okay
01:36
all right so external sorting what is it
01:40
we want to sort a collection of Records
01:43
we got end records and these records are
01:45
sitting on a disk and when we're done
01:48
sorting the sorted collection is also to
01:51
be sitting on a desk
01:52
so the inputs on a disk outputs on a
01:54
disk the constraint is that the amount
02:02
of space needed by these end records is
02:04
very large sufficiently large that it's
02:06
not possible for you to read in all n
02:08
records sort them using an internal
02:10
source scheme like quick sort or merge
02:12
sort and then write the sorted result
02:14
back ok so that's not a feasible way to
02:19
sort our collection of records we don't
02:21
have enough memory to house all of the
02:23
records at one time now the spaces
02:27
needed may be large for to somewhat
02:30
different reasons the first is that the
02:33
space needed is large larger than the
02:36
size of your memory because you have too
02:38
many records
02:39
individual records maybe small
02:41
individual records maybe large but the
02:44
important characteristic for this reason
02:46
is that the number of Records is very
02:48
large ok another reason why you may not
02:57
be able to read all the records in to
03:00
memory is that you actually have a small
03:03
number of Records but each record is
03:05
individually large and the way you would
03:09
handle these two cases could be very
03:11
different okay but in either case it's
03:18
not practical to read in the entire
03:21
collection of Records sort using an
03:24
internal source scheme and then output
03:26
the result let's look at the second case
03:31
where the number of records is actually
03:33
small which means individual records are
03:37
large now one way to handle this case is
03:42
to transform this problem from a sorting
03:46
problem into a permutation problem where
03:48
you start by just reading in the keys of
03:51
this small number of Records okay so the
03:55
number of keys are small they'll fit in
03:57
memory you read them in you sort the
03:59
keys and determine the required
04:02
permutation that has to be performed and
04:06
then you perform that permutation on the
04:09
records okay so you transform the
04:12
sorting problem to a permutation problem
04:14
and once you know what permutation is
04:16
needed you can read in I mean if the
04:19
number of records are small at so you've
04:21
got only ten records and they're big
04:22
well then you really in the first record
04:23
write it to the right spot in the output
04:26
configuration reading the second record
04:28
write it to the output to the right spot
04:30
in the output configuration and so on do
04:32
it ten times and you've got your sorted
04:34
sequence or maybe you read in the front
04:37
part of your ten records internally
04:41
permute them right out the front parts
04:43
and then read in the next parts and so
04:45
on okay so you can transform this
04:49
sorting problem to a permutation problem
04:51
which is easier to handle
04:53
when the number of Records is small okay
04:58
all right this is not the case that we
05:00
want to focus on the folk the case we
05:02
want to focus on is where the number of
05:03
records is actually very large so it's
05:06
not practical for you to read in all of
05:07
the keys and determine the permutation
05:11
okay now when we study this case of
05:17
external sort large number of Records
05:19
we'll see a need for a data structure
05:24
called a tournament tree and we'll study
05:26
those we'll see need for Huffman trees
05:32
data structures for double-ended
05:35
priority queues now you already see a
05:37
structure for a single-ended priority
05:39
queue a single-ended priority queue is a
05:41
collection of elements each element has
05:44
a priority and for this collection you
05:47
can do things like insert and remove the
05:50
element with the minimum priority say or
05:52
the element with the maximum priority
05:54
and for that you use the classic heap
05:56
data structure okay but a double-ended
06:00
project queue would require you to be
06:02
able to delete at some times the main
06:05
element and sometimes the max element so
06:07
you can delete from both ends of the
06:09
queue will study a technique called
06:14
buffering and the ideas that get
06:20
introduced in external sorting can then
06:25
be adapted back actually to internal
06:27
sorting to improve the performance of
06:29
well-known internal sort methods such as
06:32
heap sort and merge sort okay so even
06:37
though we are focusing on external
06:40
sorting the ideas that are going to get
06:42
developed there can be turned around to
06:44
improve the performance of internal sort
06:46
schemes now in order to study external
06:53
sort we need to have a model for the
06:56
computer that we are working with and
06:58
actually even to study internal sort you
07:00
need to have a model and we take a look
07:03
at the internal
07:06
model later but let's first take a look
07:08
at the external sort model okay so the
07:11
model we're going to be using is a
07:12
fairly simple one you have an automatic
07:16
logic unit this does all of the
07:20
arithmetic s-- the comparisons and so on
07:22
you have a main memory so this thing
07:25
operates off of the main memory it can
07:27
exchange data in memory and so on and
07:30
then you have a disk that's attached to
07:32
the main memory okay so the input begins
07:37
on the disk the output is to be left on
07:39
the disk now the characteristics of the
07:43
disk are the physical media itself the
07:48
data storage is organized in concentric
07:50
circles called a track you have a
07:53
read/write head okay so if you want to
07:56
read or write data you have to know
07:59
where on this this you're going to write
08:01
it or read from you have to move the
08:04
read/write head to the appropriate track
08:06
and then you have to wait for the right
08:09
part of the track to come under the head
08:11
and then you can start the read or the
08:13
right okay so when you're trying to read
08:18
a write the amount of time you take is
08:21
broken up into three components there's
08:24
a seek component this is the time it
08:27
takes to position the read/write head
08:29
over the proper track and that time is
08:37
huge compared to the time it takes your
08:40
arithmetic unit to do operations so it
08:44
may be of the order of five orders of
08:46
magnitude more then the time to do a
08:49
North medic is a very slow operation
08:50
you've got to physically move that head
08:55
you have latency time this is a time it
08:58
takes for the right part of the track
09:00
the proper part of the track to come
09:02
under the head okay so this disk is
09:05
rotating at high speed but you have to
09:09
wait for the proper part of the track to
09:10
come under the head and then you can
09:12
initiate the read or write okay
09:14
the latency is not as bad as the seek
09:18
but
09:19
it's still pretty bad you're looking at
09:22
maybe four orders of magnitude more than
09:26
an arithmetic operation then you have
09:34
the transfer time and that of course
09:36
depends upon how much data you're
09:37
transferring okay typically when you do
09:42
the transfer you would transfer what's
09:44
called a block of data and how big a
09:50
block is maybe something that is
09:52
controlled by an operating system so you
09:55
may not have any control on that when
09:57
you go out to get a block a block could
09:59
be a thousand bytes of data it could be
10:02
more or you may have the ability to
10:07
choose a block size in which case it
10:09
this has to be formatted to that block
10:11
size okay so depending upon the
10:15
environment you're in you may or may not
10:16
have any control over the size of a
10:18
block but you read in blocks of data you
10:23
write out blocks of data all right so
10:30
three components the seek time the
10:33
latency time and the transfer time okay
10:36
seeking latency are large so when you
10:39
get data from a desk you try to minimize
10:43
the number of times you go to the disk
10:45
to get the data
10:53
now typically when we develop algorithms
10:57
for internal applications which means
11:02
all of your data is sitting in memory to
11:04
begin with and the results are to be
11:05
left in memory so typically when we do
11:08
that the model we use is this one okay
11:12
so you have a arithmetic logic unit all
11:15
of the earth semantic logical operations
11:17
are done here and you have a main memory
11:20
if you want to do an ad you get the two
11:23
numbers you want to add from memory
11:25
bring them in here add and for the
11:26
result back okay so in this model the
11:31
number of times you fetch data from
11:33
memory or write back to memory is very
11:36
closely related to the number of
11:38
arithmetic operations you perform and so
11:41
whether you optimize the number of
11:42
operations or the number of memory
11:44
accesses you're pretty much doing the
11:46
same thing and you focus on one or the
11:48
other
11:49
and typically we focus on number of
11:51
operations so in all of the analyses you
11:54
might have done so far you tend to count
11:56
how many ads are you doing how many
11:58
multiplies when you say that this
12:00
algorithm runs in Big O of n square
12:02
you're saying there are n square
12:03
operations and that's a measure of the
12:07
time okay all right so this is a
12:10
simplistic model which was good maybe up
12:12
to 20 years ago or longer and it's a
12:16
model which is easy to handle in terms
12:19
of doing the analysis however the model
12:24
isn't very accurate in the context of
12:27
contemporary computers and is unable to
12:30
explain experimentally observed runtimes
12:34
okay so for example if you look at this
12:37
code here to multiply two matrices a and
12:40
B to produce a result C okay so you have
12:45
your typical three nested for-loops
12:48
producing the results C well then if you
12:54
permute these three loops you get six
12:59
different permutations it makes no
13:01
difference to the result
13:05
okay so you change the order of those
13:07
three loops to any of the six
13:09
permutations there's no change in the
13:12
answer that is produced assuming there
13:14
are there's no numerical instability so
13:17
you get the same answer whichever way
13:18
you do it and if you do your traditional
13:20
analysis there's no difference in the
13:23
operation count either the number of
13:25
multiplies adds and everything is
13:27
exactly the same and so using the
13:30
traditional internal computing model one
13:35
would be tempted to say that the runtime
13:37
of this piece of code isn't going to
13:40
change as you change the permutation or
13:43
ordering of those three four loops
13:46
however if you actually conducted the
13:49
experiment okay you would find that the
13:53
runtime is actually different and not
13:56
only is it different but the traditional
14:00
order you'll find matrix multiplication
14:02
in a textbook
14:03
the ijk order is actually very bad if
14:09
you try it out on computers like a Sun
14:13
station if you were to invert these two
14:16
loops you could see a speed-up which
14:19
could be significant in some experiments
14:22
done some time ago the ratio was 2 but
14:25
today the ratio could be larger than 2
14:29
so the question then is well what why is
14:32
it that the observed runtime varies
14:35
quite a bit in this case by a factor of
14:38
as much as 2 but potentially by larger
14:41
factors even a factor of 8 what why is
14:44
it that that happens when there is no
14:47
difference in the number of operations
14:49
okay and that has to do with the model
14:51
the model is inaccurate ok so a more
14:56
accurate model looks like this which
14:58
says that the PC that you buy for
15:02
example actually has many levels of
15:04
internal memory so you've got your
15:07
arithmetic unit and sitting very close
15:10
to the arithmetic unit is a bank of
15:11
registers the number of registers is
15:14
typically very small somewhere between 8
15:16
and 32
15:18
if you want to add two numbers that are
15:20
sitting in registers you can add them in
15:23
one cycle okay but if the numbers you
15:27
want to add and not sitting in a
15:28
register then you have to get them from
15:30
whatever level of memory they're sitting
15:33
in okay they may be sitting in l1 cache
15:36
okay so that's fast that's still very
15:39
fast memory but not as fast as register
15:42
memory okay typically l1 cache you may
15:46
have about 32 kilobytes of that there's
15:48
not a whole lot of data you can put in
15:50
there but if what you're adding is
15:53
sitting in l1 cache then that data has
15:56
to be brought from l1 cache to the
15:58
registers and then you can perform the
16:00
arithmetic okay you know it takes let's
16:03
say two cycles to move the data from l1
16:05
to the register there are two pieces of
16:07
data that could take you four cycles and
16:09
you want to do the ads so that's five
16:11
okay so if the data you're working on
16:18
isn't sitting in the registers but is
16:20
sitting in l1 cache your computer
16:22
suddenly running five times as slow okay
16:28
but this is pretty small okay so like it
16:34
or not you're going to have to sometimes
16:36
go over here to get data okay but that
16:40
takes ten cycles okay so you've got an
16:43
l2 cache which may be of the order of a
16:46
quarter megabyte some cases maybe even
16:50
half a megabyte okay so you've got to
16:52
know to cache and if your data is not
16:54
sitting here you want to add the two
16:56
numbers you got to bring them from l2
16:57
cache to l1 cache to registers and let's
17:00
say it takes ten cycles to move a piece
17:02
of data from l2 into l1 and the
17:04
registers so now you got two pieces of
17:07
data that's 20 cycles and want to do the
17:09
add 21 now you're running 21 times as
17:11
slow as you could have okay
17:15
unfortunately when you start your
17:17
program everything starts here all your
17:20
data is sitting in the main memory it's
17:22
not sitting in l2 or l1 that's the big
17:25
memory so that may be coming as a
17:29
hundred twenty eight megabytes 256
17:31
megabytes
17:31
gigabyte several gigabytes okay so all
17:36
the good data is starting in main memory
17:39
but that fellow's very slow to get data
17:42
from okay
17:43
I've got a hundred cycles down there to
17:46
get data from there and if you have to
17:50
get both of your pieces of data to add
17:52
from there that's 200 cycles plus one
17:54
you're now running 200 times it's slow
17:57
so the point is that in contemporary
18:03
computers the operations are virtually
18:05
free what costs you is the memory access
18:09
and that's nothing that anybody's ever
18:12
counted before okay and there are many
18:16
different kinds of memory access are you
18:19
accessing here are you accessing here
18:21
here or there and depending upon where
18:24
your data is sitting and you could see a
18:27
huge difference in the runtime while
18:29
performing exactly the same number of
18:31
operations so the challenge in getting
18:34
good performance is arranging the
18:36
computation so that the number of times
18:39
you have to go here versus here is
18:42
reduced or the number of times you to go
18:44
here is reduced or the number of times
18:46
here to go over there is reduced the
18:48
biggest payoff of course is in reducing
18:51
the number of accesses to main memory
18:52
because that's the slowest okay all
18:57
right so with that let's see if we can
19:00
explain the difference in the matrix
19:02
multiplication performance all right so
19:06
first we have to have a better model so
19:09
we've got this multi-level memory model
19:12
and for analysis purposes we'll actually
19:15
simplify it and just say we've got only
19:17
one level of cache the l2 cache okay so
19:20
you got l2 cache you got main memory and
19:22
you got the arithmetic unit then we also
19:26
need to understand how arrays are
19:28
organized all right so suppose you have
19:33
a two-dimensional array X then it's
19:38
represented using something called the
19:40
array of arrays representation which
19:44
means that
19:45
you have a one-dimensional array called
19:48
X and since I've got three rows I got an
19:52
X 0 X 1 X 2 then X 0 is actually a
19:56
pointer to another one-dimensional array
19:58
which is the row number 0 and the next
20:01
one is a pointer to another
20:02
one-dimensional array which is row 1 and
20:04
so on
20:05
so a two-dimensional array is really
20:08
represented as a collection of
20:09
one-dimensional arrays also when you do
20:18
a when you service a cache miss ok so
20:22
let's say somebody tries to access X 0 0
20:24
so they want to get this fellow ok and
20:28
you go to main memory to fetch X 0 0
20:31
what is transferred to the l2 cache
20:33
isn't just the value of x 0 0 but you
20:38
fill a cache line so if a cache line is
20:40
let's say 32 bits wide and you're
20:42
dealing with 8-bit integers then when
20:47
you service the cache miss for X 0 0 ABC
20:50
and D will all get transferred into l2
20:52
cache ok so just like when you go to the
20:57
disk to get data you don't just get the
21:00
one piece of data you want but you get
21:01
the whole block that contains it same
21:03
thing happens here if the data you want
21:06
isn't in cache say it no to cache when
21:10
you go to the main memory you bring in
21:12
the corresponding cache line size ok our
21:17
sense the size of l2 cache is very small
21:21
compared to that of main memory it's
21:25
possible that later in the program when
21:27
you come to get some other part of this
21:29
array it'll end up overwriting the cache
21:33
line into which you put this piece of
21:35
data so the contents of cache keep
21:38
getting changed you end up losing data
21:40
that you brought in before because the
21:42
difference in size between main memory
21:43
and your cache all right so let's
21:49
analyze this thing
21:54
all right so let's first look at the ijk
21:56
order okay so when you run this loop I'm
22:02
first going to compute c00 okay I look
22:07
back up a bit
22:08
so when I'm doing this if I look at this
22:10
line here this line is going to get hit
22:12
n cubed times okay so let's look at the
22:20
computation of c00 okay so when you do
22:23
see zero zero we will need to first get
22:28
the value of C zero zero incremented by
22:30
the right side then on the next round
22:32
gets the zero zero again and increment
22:34
it again by the new right side and so on
22:36
okay so even though we're going to
22:39
access the C values n cubed times when
22:44
we access C 0 0 and times that's done in
22:48
succession one after the other so the
22:50
value of C zero zero is brought into
22:52
cash once and then you use it n times
22:56
before it disappears okay and when you
23:00
bring in C zero zero you will also bring
23:03
in this fellow and that fellow and that
23:06
fellow if you can accommodate four of
23:07
them if you can accommodate W at a time
23:11
then before you have to service another
23:15
cache miss on C you could go through n W
23:18
iterations out here all right so the C's
23:26
are accessed you first want to use this
23:28
one you use it n times then you want
23:30
that one n times then you want that one
23:32
n times that one n times and so on you
23:34
kind of access the C's by rows ok when
23:40
it comes to the A's okay so I'll first
23:43
do aik okay so in sum row let's say
23:47
we're doing this one okay then k will
23:51
increase by one you'll go to that one
23:52
and then to that one and that one okay
23:55
so the A's are also being accessed by
23:57
row but in terms of the use pattern
24:05
we access an A Okay then on the very
24:09
next iteration we access the next one
24:10
when you access this one let's say W of
24:13
them came into my cache so you had a
24:16
cache miss here but you don't get a
24:18
cache miss there assume W is 4 then you
24:22
don't get a cache miss there you don't
24:23
get it there but you get it over there
24:25
okay now if some later time you're going
24:29
to come back and use this fellow again
24:30
okay so you use that guy if you do this
24:33
role times that okay then when you do
24:37
this row times that you'll use this
24:39
fellow again but by the time you're
24:41
ready to do this row by this fellow the
24:43
cache might have lost this value
24:45
somebody else might have overridden it
24:47
because that's coming much later in time
24:49
okay so in terms of the use pattern for
24:53
every W times you access and a you can
24:56
avoid a cache miss because you can do
24:58
this this this this then you get another
25:00
miss but when you come back to reuse
25:02
this you get a Miss okay so here you had
25:06
an NW reuse factor there you only have a
25:09
W reuse factor well when you go to be
25:15
certain the B I first do ray V KJ but on
25:20
the next time K changes so you go down
25:22
the column okay so you use you're using
25:27
it that way okay so when you use it this
25:29
way so the first time when you want this
25:32
follow you get a cache miss you bring in
25:33
four of these guys then you want this
25:35
guy you get another cache miss you bring
25:37
in for those guys you want this guy you
25:39
get another cache miss you bring in for
25:40
those guys unfortunately by the time
25:43
you're ready to use this that cache line
25:45
might have been overwritten so now
25:48
you're getting a cache miss for every
25:50
time you want to use a B value the reuse
25:52
factor is only one okay
25:54
so C has a very good reuse factor NW not
25:58
so good W horrible one
26:06
all right so okay so the number of cache
26:12
misses there are n cube uses of C but
26:16
every NW causes a cache miss so it's n
26:20
cube over n w that's n square root over
26:23
W cache misses for C over there n cube
26:29
accesses of a but do you have a reuse
26:31
factor of W so the cache misses is n
26:34
cube over W force it be again n cube
26:39
accesses of B but every access of the
26:42
cache miss so it's n cube ok all right
26:47
any questions about this alright so if I
26:55
add up my cache misses I end up with the
26:57
formula down there now let's do the
27:06
other order which I said works best i KJ
27:17
all right so let's look at C okay so in
27:22
here the J index is moving is changing
27:25
very rapidly with each time you come
27:26
here the J it increases by one
27:28
so we're accessing this by row okay but
27:32
you use this on the next round you use
27:34
that then you use that and you use that
27:36
by the time you're ready to reuse this
27:38
you might have lost it okay so this is
27:41
like the a of the previous example the
27:43
reuse factors W so look at the a okay so
27:52
the I is the outermost index okay then
27:55
you have the K value right so if I hit
28:01
this thing end times I'm still accessing
28:03
the same a value okay and once I'm done
28:08
hitting this n times the next time
28:10
around the I guess value of K changes so
28:15
I get the next one here then I use this
28:17
end times and the value of K changes I
28:20
get that one so that's like what is
28:21
happening with C before you're still
28:23
accessing it by a row but every NW times
28:28
that you come here you will get a cache
28:30
miss the be okay that's bkj so on every
28:39
time you come to this line J changes but
28:42
J is a column index now so you're still
28:44
accessing B's boy row okay time you come
28:50
here you get a different fellow okay so
28:57
you get only a reuse factor of W by the
28:59
time you come back to reuse an old one
29:01
it could have been overwritten okay so
29:05
you have a reuse factor of W and W and W
29:12
okay previously it was nww and one so
29:15
we've improved the reuse factor or one
29:18
of the arrays without reducing the reuse
29:21
factors of the others
29:26
okay all right so see had a reuse factor
29:30
of W so you get W n cube over W this is
29:33
a reuse factor of n W and then B had a
29:36
reuse factor of W C of n cube over W if
29:40
you add these up you end up with that
29:42
formula okay so the important thing is
29:46
that in the traditional ijk scheme the
29:50
reuse factor was n w WN 1 and now does W
29:55
and W and W and that accounts for the
29:59
difference in performance that people
30:01
observe all right so we arrived at these
30:08
two equations we can take the ratio of
30:11
cache misses with ijk vs. ikj assuming n
30:16
is large works out to about 1 plus W
30:18
over 2 so if you're dealing with a 32
30:24
byte cache line and 8 bytes per data we
30:30
have 4 pieces of data coming in W is 4
30:32
in which case you could expect a
30:36
performance ratio of about 2.5 on a
30:39
large matrix if you have a 64 byte cache
30:46
line same 8 bytes per data then W
30:50
becomes 8 you can expect a ratio of 4.5
30:55
if you're dealing with integer data for
30:58
bytes then W is 16 you know you get a
31:02
ratio of 8.5 so being aware that
31:09
computers have caches and trying to
31:12
reorganize the computation to optimize
31:14
the user cache can improve the
31:16
performance of your code quite a bit
31:22
okay and and so you using some of the
31:27
ideas from external sword back into
31:28
internal sort could help us with the
31:32
internal internal sword schemes because
31:34
basically as we'll see the idea is an
31:36
external sort are there to reduce the
31:38
number of times you go to disk in the
31:40
case of internal sort you want to reduce
31:42
the number of times you go to main
31:43
memory alright so we can get faster
31:49
internal sorting by using some of the
31:51
ideas we're going to talk about
31:54
experimentally people know that if you
31:57
recode merge sort to use something
32:00
called a tiled merge sort at the high
32:04
level you're doing exactly the same
32:05
thing but you reorganize the code you
32:08
can run two times as fast as the
32:10
textbook merge sort similarly there are
32:17
schemes known for heap sort for instead
32:19
of working with a binary tree of degree
32:21
two you work with a tree of degree four
32:22
you can get a speed-up using essentially
32:25
the same heap sort idea okay all right
32:28
so even though we're going to have our
32:30
discussion in the context of external
32:32
sort
32:32
you shouldn't go back thinking that
32:36
these ideas apply only to the external
32:38
domain they apply also to the internal
32:40
domain all right so let's now turn our
32:45
attention to external sort methods now
32:48
if if I came to you to you and I said I
32:53
need you to write a code to do an
32:55
external sort being experts an internal
32:58
sorting you'd probably say let me take a
33:00
look at the best internal source scheme
33:02
I have and see if I can adapt that to
33:04
the new environment that seems to be
33:06
kind of the logical way to solve a
33:08
problem related to something you know
33:10
how to solve so now in the internal sort
33:14
word we know that depending upon whether
33:17
you're looking for a verage performance
33:18
or worst-case performance the choice for
33:22
method is different so if you're trying
33:24
to optimize the average runtime then
33:26
you'll use quicksort on the other hand
33:29
if you're trying to optimize the
33:30
worst-case runtime you will use a merge
33:32
sort
33:35
so depending upon what your interest is
33:38
you might want to adapt this fellow or
33:40
you may want to adapt that fellow and we
33:43
take a look at adapting both of them
33:49
alright today we will see only how to
33:51
adapt quicksort and then in the next
33:53
lecture we'll see how to adapt would
33:55
sort alright so let's take a quick
34:00
overview of an internal quicksort see
34:03
how that works okay but basically I can
34:06
describe this thing recursively you have
34:08
a collection of records to sort I'm
34:11
showing only the keys the first thing
34:13
you do is you pick a pivot using some
34:15
pivot picking strategy in this case I'm
34:18
picking the leftmost fellow six as the
34:21
pivot and then you petition everything
34:23
based on the pivot so that the pivot is
34:25
somewhere in the center everybody less
34:28
than equal to the pivot is on the left
34:30
side those gredin equal to it on the
34:32
right side okay so i've reorder things
34:35
so that everybody to the left of the
34:38
pivot is less than equal to the pivot
34:40
everybody to the right of the pivot is
34:42
greater than equal to the pivot and then
34:44
i sort this left part recursively and
34:47
the right part recursively okay and
34:52
basically that's all there is to
34:54
quicksort in quicksort this middle group
34:59
is of size one now in our dap tation to
35:05
an external sort the middle group would
35:07
be large okay we'll see how that comes
35:13
about all right so I take the memory
35:21
that's available to me I'm going to
35:24
break it up into components for
35:26
different use so my strategy is you've
35:32
got all of this data sitting on a disk
35:34
and somehow I have to take that data and
35:39
break it up into three groups a middle
35:41
group then a group of small elements
35:44
that's the left group and then a group
35:46
of big elements that's the right group
35:49
everybody in the left group is less than
35:51
equal to everybody in the middle group
35:52
everybody in the right group is great
35:55
niggled everybody in the middle group
35:56
and then I will recursively sort the
35:58
left and right groups assuming the
36:02
middle group is already sorted so that's
36:04
the general strategy so it's pretty
36:06
similar to the internal quicksort except
36:07
that the middle group is now of size
36:09
more than one so in order to accomplish
36:13
this I have to keep note of the fact
36:17
that my data is not in memory but it's
36:19
sitting out on a disk ok so what I'm
36:23
going to do is I will take the memory I
36:25
have and in this simple strategy I will
36:28
break it up into what I'll call three
36:30
buffers these are input or output
36:34
buffers the size of our buffer will be
36:37
the same as the size of a block on disk
36:39
to do any work I got to bring the data
36:41
into memory and then I can do work on it
36:44
okay so when I bring my data I'm going
36:47
to need an input buffer I bring a block
36:49
of data put it into memory I need
36:51
something big enough to hold a block I
36:53
call that an input buffer when I want to
36:55
write something on to disk I'm going to
36:57
need a place in memory where I
36:59
accumulated the stuff so I can write a
37:01
block out I don't want to be writing out
37:03
one record at a time because that's too
37:06
many disk accesses so was it be working
37:09
in units of a block okay so in this
37:13
strategy I'm going to use an input
37:15
buffer for reading from my disk one
37:21
block at a time okay so that's one block
37:26
long then I'm going to collect the left
37:31
part the stuff that is smaller than the
37:34
middle block the left group is also
37:36
going to be collected in memory one
37:37
block at a time okay so I'm going to
37:40
have a buffer that we'll call the small
37:42
buffer okay and that's also one block
37:47
long then I'm going to have a buffer
37:52
called the large buffer and that's where
37:54
I'm going to collect the right group one
37:55
block at a time okay so the idea is
38:00
going to be you
38:01
whenever you need to read data will read
38:05
it in here will process the data records
38:09
that are identified as being part of the
38:11
left group will be collected here those
38:14
identified as being part of the right
38:15
group will be collected here when this
38:18
thing becomes empty
38:19
I'll read in another block from disk
38:21
when this thing becomes full I'll write
38:24
it to disk and then I can start to
38:26
accumulate more of the left group when
38:29
this fellow becomes full I'll write it
38:31
to disk then I can begin to accumulate
38:33
more of the right group or the big group
38:39
the remainder of the memory thus this
38:43
stuff here is going to be used to create
38:46
the middle group right so for this
38:51
scheme to work the assumption is that
38:53
the amount of memory you have is more
38:56
than three blocks one that's a pretty
39:00
reasonable assumption because block
39:02
sizes on this may be of the order of a
39:04
few kilobytes whereas the amount of
39:08
memory you have is of the order of
39:09
hundreds of megabytes or gigabytes okay
39:14
all right so I got this chunk of memory
39:17
here for the middle group as far as the
39:21
male group goes it has to satisfy the
39:24
property that everybody you output here
39:28
must be less than equal to everybody
39:30
here everybody you output through this
39:32
buffer has to be greater than equal to
39:34
everybody here yeah
39:45
well at this point the only requirement
39:49
is that the records of the middle group
39:52
should be less than equal to those in
39:54
the large group and granny glued those
39:55
in the small group once we are done
39:58
okay says as I described the way the
40:00
process works we output this and these
40:03
two groups rough block by block but when
40:07
you have finished processing all of the
40:08
records on the disk you'll be left with
40:10
the whole chunk of records here and then
40:12
you have to write those out and to write
40:15
those out you'll need to sort them first
40:17
so you can write them out as a sorted
40:20
middle group that's correct there'll be
40:28
three partitions the size isn't known
40:31
ahead of time it depends a lot on the
40:34
order of the data just like in quicksort
40:37
there are three partitions that are
40:39
created small middle and large put
40:59
operational delay the things well of
41:00
course they will delay the things and
41:02
the real question is can you do anything
41:04
about it okay so if you're in a multi
41:07
processing system can you control the
41:09
other users no you can't but do you have
41:12
an alternative to performing the task
41:14
that's really the issue if you're on a
41:17
single user system then you can get into
41:20
schemes to try and optimize reading from
41:23
a disk and writing to a disk but in a
41:26
multi-user system where lots of people
41:28
are sharing the disk you can't control
41:29
the position of the disk head
41:38
so there is no efficient way known to
42:13
partition into equal sized groups okay
42:16
so whenever you petition you pick a
42:18
pivot you either pick a random Parvati
42:20
take a million of three or more
42:21
sophisticated rules like median of seven
42:23
million of nine or whatever but they
42:25
don't guarantee equal size partitions
42:28
okay second is that's if you forget
42:34
about how you choose the pivot the the
42:37
balance of the strategy is based on an
42:39
internal scheme which means the whole
42:42
data sitting in there which isn't
42:43
practical here okay so all right so you
42:52
got your disk out there so as I
42:57
explained the way it's going to work is
42:58
to initialize the system you just read
43:01
as many blocks from disk as needed to
43:04
fill up your middle group whatever
43:05
member you got there then you read in
43:07
one more block to fill up the input
43:09
buffer okay so that's the initial
43:11
configuration and then you start
43:15
processing so you look at the next
43:18
record in the input buffer if it's less
43:22
than equal to everybody in the middle
43:23
group you put it out here okay if it's
43:27
greater than equal to everybody in the
43:28
middle group you put it into the large
43:31
buffer if it lies in between okay then
43:38
you can't put it here and you can't put
43:40
it that okay
43:41
so then you have to either take the
43:43
smallest element here and move it there
43:45
or take the
43:47
element here and move it there and then
43:49
the new element can be pushed in to the
43:51
middle group okay so there are three
43:54
cases and these three things strategies
43:58
cover all of those cases while you're
44:09
processing this these three steps you
44:13
really going to have a loop this is
44:15
basically what you're going to be doing
44:16
throughout okay so in this process the
44:19
input buffer may become empty so then
44:21
you got to stop to refill it this may
44:24
become fully got to stop to output it
44:25
that may become fully got to stop to
44:28
output it okay then you can continue
44:31
once you're done whatever is sitting in
44:36
this middle group can be sorted and
44:39
written out using an internal sort and
44:41
then you recursively sort the small
44:44
group and the large group the so the
44:54
issue that arises here now is what data
44:57
structure do you use to implement the
44:59
middle group so the operations you need
45:04
to perform on the middle group R and C
45:08
we've gone through this okay but the
45:13
operations you perform the middle group
45:14
when you take somebody from here you got
45:17
to insert it you may have to insert it
45:19
here okay because it lies between the
45:23
things oh oh maybe I should go in the
45:27
order in which I had them in the
45:28
previous slide okay you may take the
45:33
smallest item let's see so you take
45:38
somebody from here if it's less than
45:40
equal to the smallest one and goes out
45:41
there so that's easy you just put it at
45:43
the end of that buffer okay if it's
45:45
greater than equal to everybody here you
45:46
put at the end of that buffer otherwise
45:49
this item has to be inserted here but
45:51
you insert it here you got to take
45:53
somebody out because this guy is full so
45:56
you may want to take the minimum item
45:59
out
46:00
you may want to take the maximum item
46:02
out in order to get a roughly equal
46:05
small and large group you would probably
46:09
have some heuristic which says if
46:11
currently the small group is bigger than
46:13
the large group take out the max element
46:15
and put it here
46:16
if currently the small group is smaller
46:18
than the large group take out the main
46:20
element and put it here in an attempt to
46:22
balance the size of these two groups
46:24
okay
46:25
so I'll need to be able to remove
46:27
sometimes the men sometimes the max once
46:30
you've done the remove unit you will to
46:31
do an insert okay so you need to support
46:35
what's called a double-ended prior to Q
46:37
you need evil to insert you need evil to
46:39
remove the men you also need to build to
46:41
remove the max okay so to implement this
46:45
I need a double ended priority Q okay so
46:49
there'll be one application for double
46:50
ended priority queues another possible
46:55
application for project queues might be
46:58
in a queue in a network for example you
47:00
have a finite size queue and you have
47:03
packets coming in so when a packet comes
47:07
in it gets buffered into the queue
47:09
waiting for the output line to become
47:10
idle when the output line becomes idle
47:14
you take the highest priority packet out
47:16
of the buffer and put it onto the queue
47:17
so you want to do a delete max when
47:22
somebody comes in you want to do an
47:24
insert but when somebody is coming into
47:26
the buffer and the buffer is full then
47:29
you probably want to drop the lowest
47:30
priority packets you want to do a delete
47:32
min so you could implement a network
47:35
buffer as a double-ended priority queue
47:46
now in the sword application we could
47:50
improve the performance of quicksort
47:52
okay
47:53
so while you're doing whatever we said
47:57
if we had another input buffer I could
47:59
be simultaneously reading a buffaloed
48:01
instead of idling on my i/o channel okay
48:05
so then I could switch between the
48:06
buffers same thing here
48:08
okay I could be I could have two small
48:12
buffers while you're writing this you
48:15
can still continue to be filling the
48:16
other one I could have two of these
48:18
buffers while you're writing this one
48:20
you could continue on the other one you
48:21
don't have to stop okay so you get this
48:25
idea of having more buffers than as
48:26
essential so as to improve performance
48:31
that issue is more interesting when we
48:33
look at void sort we'll do that later
48:38
all right so I guess that's kind of a
48:41
summary of some of the ideas behind
48:44
external sword especially behind the
48:46
adaptation to quicksort and also why
48:49
what we're going to be doing apply is
48:52
not just to external sorting but also to
48:54
internal sorting and also to internal
48:57
computing in general trying to optimize
49:00
cache utilization any questions
49:09
okay we meet Wednesday
49:27
you