Transcript


00:10
okay yeah are there any questions before
00:13
we start today no questions okay all
00:19
right so we're going to look at a new
00:23
problem which I guess in some sense is
00:26
related to problems you looked at before
00:28
this is a multi-dimensional range search
00:31
problem okay so when we were looking at
00:34
priority search trees we were handling
00:36
data two-dimensional data points and
00:39
with XY coordinates now I guess we look
00:43
at higher dimensions but the structures
00:45
will also be applicable to collections
00:48
of points in two dimensions the priority
00:54
search trees even though we were able to
00:57
handle points in two dimensions there
01:00
were requirements there that all of the
01:02
points should have a y value that's
01:03
written equal to zero and then all the
01:08
rectangles kind of made that assumption
01:10
that the bottom that the bottom of the
01:14
query was always y equals zero so you
01:16
really couldn't do a quarry efficiently
01:19
for a rectangle that was sitting
01:21
somewhere in arbitrary two-dimensional
01:24
space with a nonzero y value all right
01:27
so here
01:32
these structures we're going to look at
01:34
today for the static case so you're a
01:36
collection of points in
01:38
multi-dimensional space does not change
01:40
in time you know that collection ahead
01:43
of time and then you build a data
01:44
structure to perform curries in an
01:46
efficient manner all right so no changes
01:50
to the collection of points the only
01:53
operation being performed is one of
01:55
query of course there's the operation of
01:57
construct construct the data structure
01:59
initially the data we're dealing with is
02:08
k dimensional data so you have K code
02:13
and the curries are correspondingly
02:15
k-dimensional curries so for example
02:19
we'd be given case sets of ranges L
02:23
being a lower bound for the range you an
02:25
upper bound okay and then we need to
02:29
report all points or data in our
02:33
collection which fall within the given
02:36
range so in each of the K feels your
02:39
value lies between Li and UI so in the
02:45
case of two dimensions this would
02:47
correspond to being provided a rectangle
02:49
right k dimensional quarries show up
02:57
frequently in database type applications
03:00
so for example we may want to report all
03:04
employees in our collection with the
03:06
property that their age lies between 30
03:08
and 40 okay and the salary lies between
03:11
40,000 and 70,000 okay so you have a
03:14
two-dimensional query l1 is 30 u 1 is 40
03:19
l 2 is 40 L u 2 is 70 you'd go to higher
03:27
dimensional curries so in a geographic
03:30
database and report all the cities in
03:33
your collection that have rainfall
03:35
between 40 and 60 inches so l1 is 40 and
03:38
u 1 is 60 and where the population lies
03:43
between hundred thousand and two hundred
03:44
thousand so l2 is hundred thousand l2 is
03:47
two hundred thousand the average
03:50
temperature in the cities got to be at
03:51
least 70 degrees Fahrenheit so l3 is 70
03:55
and u3 is infinite and the number of
04:00
horses in the cities between thousand
04:02
twenty five and twenty five hundred so
04:04
you get a four dimensional query here
04:07
alright so when you're dealing with
04:09
collections of records whether they be
04:12
employee records or records that
04:15
describe property of cities we need to
04:17
be able to handle queries in multiple
04:20
dimensions and the number of dimensions
04:22
could get fairly large
04:26
well there's several data structures
04:29
that one can think of some of them a
04:33
simple straightforward others more
04:36
involved for solving this problem okay
04:40
so the simplest thing would be maybe say
04:43
no data structure at all you just take
04:46
your collection of records store them in
04:48
some arbitrary unordered fashion you get
04:50
a curry you just examine every single
04:52
record see which one satisfy the query
04:55
and report those alright next in
05:01
simplicity may be something called
05:03
sorted tables so you sort your
05:09
collection of Records on each of the K
05:12
keys of each of the K fields and you
05:18
build one table whereas table is just a
05:20
sorted collection of records for each of
05:22
those K Keys so you end up with K of
05:25
these tables and then when you get a
05:29
curry you examine each of these tables
05:34
so you look at table one and find out
05:37
how many records Y satisfy the l1 you
05:42
one range look at table to find out how
05:46
many satisfy the o2u to range look at
05:49
table 3 find out how many live in the l3
05:51
you three arrange and so on then you
05:54
take the smallest of these and examine
05:58
from that table more carefully to see if
06:01
they satisfy the remaining requirements
06:07
so if you have two dimensions okay then
06:12
maybe you first look at Table one and
06:14
find that there are 100 records that lie
06:17
between l1 and u1 look at Table two you
06:20
may find that there are only 20 records
06:23
that lie between l2 and YouTube so I'll
06:27
use the table to results take a look at
06:29
those 20 records and verify which ones
06:32
lie between l1 and u1 and report those
06:34
as the result
06:37
but then there's a structure called
06:39
cells I'm going to describe that here
06:41
you can take a look at that from the
06:43
reference something called KD trees
06:47
we're going to look at that structure in
06:49
detail today and something called range
06:54
trees we'll take a look at that
06:56
structure also in detail today
06:59
the other more involved structures Capel
07:02
trees we're not going to have time to go
07:04
through those in this class and then
07:09
there are K ranges we won't have time to
07:11
look at those either so we're going to
07:16
focus just on these two here today all
07:21
right so let's start by taking a look at
07:22
KD trees maybe before we look at that
07:28
let's put up the metrics we're going to
07:29
use to evaluate the structures so we're
07:33
going to be concerned with the amount of
07:35
time it takes to build the data
07:38
structure so that's the pre-processing
07:40
time P time required to build the data
07:43
structure to pre-process the data when
07:46
you have n records and K keys okay is
07:50
the number of dimensions for most
07:57
applications this is not a very
07:58
sensitive quantity you pre-process the
08:02
data once and then you search it many
08:05
many times and the only requirement
08:07
really is that the pre-processing time
08:10
be reasonable what does reasonable mean
08:13
in some contexts reasonable might mean
08:16
that an hour is enough in other contexts
08:19
a month may be enough just depends on on
08:25
how much you hope to gain by putting in
08:27
the extra pre-processing time so we may
08:33
not be terribly concerned with the
08:35
amount of time spent pre-processing
08:38
except when it becomes intuitively very
08:42
unreasonable and if it takes a hundred
08:43
years it's probably not a reasonable
08:46
amount of time
08:47
all right we be concerned with the
08:50
amount of space the structure takes so
08:53
the structure that's built as a result
08:55
of this pre-processing how much space
08:57
does it take and again the concern about
09:03
space would be very dependent upon the
09:08
application you have if you're going to
09:10
take the structure and store it on a on
09:14
a DVD and then sell these DVDs there's
09:19
no advantage to using say less than two
09:23
gigabytes of space if that's the
09:25
capacity of your DVD so getting the
09:28
space down to below the capacity of the
09:30
physical media may have zero value so
09:37
again depending upon how you intend to
09:39
use the structure exactly what value or
09:43
importance you want to place on space
09:45
would vary and then there's the Curie
09:50
time how much time does it take to
09:53
answer a query and that's really the
09:56
going to be the most important criteria
09:58
here because you're going to be
10:00
answering queries many many times you're
10:02
going to build it once space or long as
10:05
you can fit it in the media on which you
10:07
plan to run the queries that's probably
10:10
adequate all right so let's take a look
10:16
at the first at the first of the two
10:18
structures we want to study today this
10:20
one is the KD tree the KD tree is a
10:25
binary tree and basically at each node
10:33
of this binary tree you partition your
10:36
set of points into two roughly equal
10:39
groups for this petitioning you pick a
10:44
coordinate on which to cut the points
10:46
and once you pick a coordinate okay so
10:52
first you pick a coordinate and there
10:55
are different strategies to pick
10:56
coordinates but generally one uses this
10:59
heuristic here which says
11:01
pick a coordinate in which the values of
11:04
your points is has the maximum spread so
11:09
if you're looking ATS they're all of the
11:10
first coordinates you'll find the
11:12
smallest first coordinate the largest
11:14
was going to take the difference
11:16
there'll be the spread in that
11:17
coordinate so so pick one where you got
11:20
a good large spread once you've picked
11:24
the coordinate then for the actual point
11:26
on which to do the splitting of this set
11:29
of points you take the median value in
11:34
that coordinate I will see an example to
11:39
make that more clear okay so the node of
11:43
the binary tree will actually store the
11:46
coordinate on which you do the cut as
11:47
well as the value of the coordinate that
11:49
you use then all of the K dimensional
11:56
points in your collection which on that
11:59
particular coordinate have a value less
12:02
than equal to M will go into the left
12:04
subtree and those with the value bigger
12:07
than m will go in the right subtree so
12:11
the expectation is since you're picking
12:13
the median but half the points will show
12:14
up on this side and half will show up on
12:16
that side okay you end up with a slight
12:20
problem when there are a large number of
12:21
points that are equal to the median then
12:23
you don't get it good half split
12:34
you stop this petitioning process when
12:37
the number of K dimensional points in a
12:41
partition becomes small enough so
12:44
typically you don't go all the way down
12:45
to one point you have some threshold the
12:48
threshold I guess what I've got on the
12:51
slide is say 8 or 16 points again the
13:02
the the threshold in a real situation
13:07
may be determined by physical
13:09
characteristics of the media where
13:11
you're storing this thing the structure
13:13
so if you're storing it on a hard disk
13:16
then the retrieval unit from your disk
13:19
would be a block and so once you get a
13:22
partition down to a size of a block you
13:25
just leave it at that bring in the whole
13:26
block and search it using some in some
13:30
alternative structure okay all right so
13:34
we'll stop when the partition size
13:36
becomes small let's take a look at an
13:41
example and the example would be a two
13:45
dimensional case so I've got a
13:48
collection of points which are thrown
13:50
out in two-dimensional space or to
13:53
organize these into a 2d tree so the 2d
13:58
tree is going to be a binary tree in
14:01
fact doesn't matter whether you have two
14:03
dimensional or twenty dimensional points
14:04
it's always a binary tree okay for the
14:09
route I need to pick a dimension along
14:11
which to partition these points into two
14:14
groups okay depict the dimension I look
14:18
at the spread in X direction as well as
14:25
the spread in the y direction so the
14:27
spread in the X direction this point has
14:29
the lowest x value in this fellow I
14:32
guess up there has the highest x value
14:34
so the spread is you take the difference
14:36
in the two X values look at the spread
14:38
in the y direction so this one's got the
14:41
lowest Y and somebody up there's got the
14:43
highest Y look at that so here the
14:46
spread is more in the X direction so
14:48
pick the I choose to cut in the
14:51
x-direction alright then I need to
14:54
select a value of X to cut so it picked
14:57
the median of the X values of this
14:59
collection of points okay and that turns
15:03
out to be this x value here so points
15:09
that are less that have an x value less
15:11
than equal to this cut that's these
15:13
including this one on the line and the
15:16
cut line will show up on the left
15:17
sub-tree and the remaining points those
15:20
to the right of the cut line will show
15:22
up in the right subtree
15:25
alright so the route will contain
15:30
information about the coordinate that
15:32
you picked and the value of that
15:34
coordinate so even though my label here
15:37
just says a is the cut line in reality
15:40
what you distort here is the cutters on
15:42
X and the value of x is whatever
15:44
corresponds to this point okay all right
15:48
so in the left sub-tree I'm going to
15:50
have these three four five six seven
15:52
eight nine points and then the remaining
15:54
two four six eight nine points in the
15:57
right subtree alright so now for the
16:03
left subtree I've got these nine points
16:06
I've got to divide those into two groups
16:13
for this example I will assume that the
16:16
threshold at which I stop dividing is
16:18
three okay so since the number of points
16:22
is more than my threshold I got it
16:24
divided to divide I got to pick a
16:26
coordinate I look at the X spread so
16:29
expert is from here to over there
16:31
the Y spread is from here to over that
16:35
okay so the X spread is bigger I'm going
16:37
to divide on X and I pick a median X
16:41
okay so I end up picking this value of X
16:45
with this these one two three four five
16:48
points will be in the left subtree and
16:50
then these one two three four will be in
16:53
the right subtree
16:59
okay all right for the other side the
17:07
number of points exceeds my threshold of
17:09
three so again I got to pick a
17:13
coordinate to cut on I look at the
17:16
expert and the widespread the expert is
17:18
more than the why so then I got to pick
17:20
a median x-value to do the cut and it
17:24
turns out to be that one over there see
17:26
and so that'll give me these one two
17:28
three four five points in the left
17:30
subtree of c and these four and the
17:33
right subtree of C all right so the left
17:45
subtree of B has these five points and
17:48
since that's bigger than the threshold
17:51
I've got to split the left subtree of B
17:54
to split the left subtree I look at the
17:56
X pride so that's from here to there and
17:58
the widespread is from here to here so I
18:01
determine which one is bigger and then
18:04
in this case it's the x1 again and then
18:07
I pick a median X which turns out to be
18:10
D these three points end up in the left
18:12
subtree and these two end up in the
18:15
right subtree okay all right so I do
18:18
another cut of the left subtree of B
18:21
that's down the long D the right subtree
18:25
of B has one two three four points so
18:28
you got to do a cut the X spread is now
18:31
from here to there the wide spread is
18:33
from here to there
18:34
the wide spread is bigger so I do a cut
18:37
on why I picked the median Y which turns
18:40
out to be that fellow so these two
18:42
points are in the left subtree and those
18:45
two in the right subtree okay so you cut
18:48
on D all right so we cut at E now if you
18:56
go back a moment okay so the left
18:58
subtree of d as these three points it's
19:00
not going to be partitioned any more
19:02
because the threshold were using is
19:03
three the right subtree has these two
19:06
points that will not get partitioned
19:07
same thing here for
19:10
the left subtree has two points and the
19:13
right subtree has two points so those
19:16
will not be partitioned further and so
19:19
going into c okay the left subtree of c
19:23
is got one two three four five points we
19:26
have your partition I look at the x
19:28
spread
19:29
that's from I guess here to over there
19:32
and the wide spread is from here to
19:34
there
19:34
okay so the Y spread is more I picked
19:38
the median Y value which is f so these
19:43
three points in the left subtree and
19:45
those two in the right subtree going to
19:52
this side right subtree of see i've got
19:54
these four points smaller than the
19:56
threshold so you're going to split the x
19:59
spread is from here to there and the y
20:02
spread is also determined by those two
20:04
points okay so in this case the x spread
20:09
this distance is more than that distance
20:11
I pick an x value the median X is G so I
20:15
get these two to be in the left subtree
20:17
of C and the other two in the right
20:19
subtree of C okay so in this case none
20:22
of those four nodes that we have are
20:25
going to get split any further instead
20:28
they'll just point to buckets which
20:31
contain the respective points okay so
20:34
the left subtree of d has these three
20:37
points it's a bucket which has those
20:38
three points the right subtree of d is a
20:41
bucket that has these two points
20:58
all right so because you used the median
21:04
value to do the splits we expect that
21:07
the number of levels in this tree will
21:09
be log of n at each level the size of a
21:13
partition becomes roughly half all right
21:21
so in a KD tree you have a binary tree
21:25
which is used to petition all of your
21:27
points into a collection of buckets the
21:33
points themselves are stored in a bucket
21:35
each point is stored in exactly one
21:37
bucket let's take a look at how you
21:47
might perform a search so to perform a
21:50
search you are provided a curried
21:53
rectangle in the two-dimensional case so
21:56
I'm given a l1 value a u one value and
22:00
l2 value and a u2 value unlike priority
22:05
search trees where the l2 value has to
22:08
be 0 here the l2 value could be anything
22:14
right so with this query rectangle I'm
22:21
required to report all the points in
22:23
there and that's going to be this blue
22:25
point and that blue point here okay all
22:29
right this yellow line here is just to
22:31
the right of my country ok so to find
22:37
all the points that line the rectangle I
22:39
really need to find out which of these
22:42
partitions that I have overlap with this
22:45
rectangle and then search those
22:47
petitions each of these partitions
22:49
corresponds to a blue bucket okay so
22:53
this partition here doesn't overlap with
22:57
this ok again it's not evident from this
22:59
figure but this yellow line terminates
23:01
just a bit to the right of this cut okay
23:04
so this petition has no overlap so I
23:08
don't really need to search that bucket
23:11
this tall petition overlaps I go to
23:13
search that bucket this partition here
23:15
overlaps I go to search this bucket and
23:17
none of the others to perform the search
23:24
you start at the root okay
23:29
and you determine whether your Curie
23:32
rectangle in this case is entirely to
23:35
the left of this or entirely to the
23:41
right again when I say to the left would
23:42
mean if it's less than equal to it's
23:44
totally to the left otherwise is totally
23:47
to the right all right so so in this
23:55
particular case my query rectangle is
23:58
totally to the left of this so that
24:00
tells me there are no points in the
24:02
right subtree of a so I can cut that off
24:04
I don't need to search that then I move
24:07
here and I look at B and I see that
24:11
there is an overlap you could find
24:13
points in the left subtree of B as well
24:15
as the right subtree of B so I go to
24:17
search both the left and right subtrees
24:18
okay then when you come at D we see that
24:24
there's no possibility of points on the
24:25
left subtree of d because there's no
24:27
overlap so we don't need to go into that
24:29
subtree and if you're looking out here
24:34
in to be I not have to compare whether
24:37
you're above or below this cut line e
24:40
okay so from that node I know that this
24:44
is a horizontal cut line okay so I check
24:48
whether my rectangle is above or below
24:51
or if it straddles if it straddles you
24:53
got to do both sub trees okay so here
24:56
I'm below so I only have to do the left
24:58
subtree so a chop off the right all
25:04
right so we use the tree structure to
25:07
identify the buckets that need to be
25:09
examined closely and then in this case
25:13
I'll end up with these two buckets I
25:15
look at those two buckets and report the
25:17
points in there which actually lie
25:20
within the rectangle okay they may be
25:22
points in these
25:24
two buckets that do not line the
25:25
rectangle and in this particular case
25:28
yeah there are so this fellow here okay
25:32
and that fellow are not in the rectangle
25:36
even though they are in this particular
25:38
bucket all right so that's how the
25:42
structure works
25:57
or any questions about the structure
25:59
before we analyze its complexity yeah
26:16
the the threshold usually is not a
26:18
function of the initial number of points
26:20
it's really based on on looking at a
26:27
payoff between going down the tree
26:30
versus just searching a bucket directly
26:32
okay so the you have a cost associated
26:36
with moving down the tree because you
26:38
need to check whether you have a
26:39
rectangle overlap okay and depending
26:45
upon where this structure is sitting you
26:48
have a cost associated with accessing
26:49
this node from that media okay so at
26:53
some point it's cheaper to just take a
26:55
look at say 10 points or 20 points and
26:58
go through them than it is to go through
27:00
a binary search tree and then eventually
27:03
look at two points site so that's
27:07
independent of how many points you start
27:08
with it's more a characteristic of the
27:11
media you're dealing with okay all right
27:17
other questions
27:21
all right so let's look at the
27:23
performance all right first you create
27:27
the structure so to create the structure
27:30
takes n log n time and let's see why
27:34
that is so okay
27:36
all right since we're petitioning by
27:38
medians we expect to have log n levels n
27:41
is the number of records in your
27:44
collection at each level okay
27:50
so for example here I need to pick a
27:54
dimension to cut on so for that I look
27:57
at the spread so you got to find the
27:59
smallest X and largest X take the
28:02
difference smallest Y largest why take
28:04
the difference okay so you can do all
28:06
that in order n time well it's really K
28:15
times n okay but if you assume that K is
28:19
not part of the complexity measure then
28:21
it's just n okay
28:23
so once you fix the number of dimensions
28:25
we just the exam okay
28:27
so strictly as K times n okay but if you
28:32
zoom K is not a parameter here then over
28:34
here all right so you can find the
28:39
dimension to cut in order n time once
28:42
you've found the dimension you got to
28:44
find the median you can find a median of
28:46
n items in order n time okay so in order
28:49
n time you can determine this value okay
28:53
now when you come down here there are n
28:56
over 2 here so it takes you order and
28:58
over to to find the stuff here
29:00
similarly order in over to there so
29:02
total is again n same thing here n over
29:07
4 n over 4 and over 4 and over 4 total
29:09
of n so at each level you spend order n
29:13
time to find the dimension to cut on as
29:16
well as the medians and then to actually
29:20
do the partitioning of the points is
29:21
linear in the number of points you're
29:23
petitioning so you've got n over 2 you
29:25
can create the n over 4 for here and the
29:27
arrow for here in n over 2 time okay so
29:31
it takes you n time for a level to do
29:33
the petition
29:36
all right so total of log n levels at
29:39
each level you spend order n time to do
29:42
the work the works out a total of n log
29:43
n so there are other ways to arrive at n
29:47
log n you could say over there you could
29:53
just sort by if you're looking at say 2d
30:00
case sort by X sort by Y have two sorted
30:03
lists okay and then to find the spreader
30:06
just look at the endpoints of the two
30:08
lists compute the difference so now
30:10
instead of K and it's okay okay
30:13
the median is simple you just pick the
30:15
middle element from whichever list you
30:18
decide to cut on and then you split
30:21
those lists by marching down them left
30:23
to right you're retaining the sorted
30:25
order so again you saw it only once you
30:29
spend the N log N or if it's K
30:32
dimensions K n log n time to do the sort
30:35
on each dimension once and then you the
30:39
splitting is linear time as you go down
30:41
so it's still order n log yeah alright
30:46
so pre-processing takes n log n the
30:52
amount of space that you need is order
30:55
then the total number of buckets is
30:58
order of n if you view these as external
31:04
nodes to the tree the number of internal
31:06
nodes is one less than the number of
31:07
external nodes so the number of internal
31:10
nodes is n minus one at most so total
31:14
number of nodes in the system is n see a
31:15
total space is order thin
31:22
okay again this kind of a crude count
31:25
for your records are bigger than these
31:27
so maybe you want to say you need space
31:30
for n records plus n minus 1 internal
31:33
notes
31:42
alright the performance analysis is
31:44
somewhat difficult to perform and we're
31:48
not going to do that here I'm just going
31:49
to give the results the the worst case
31:56
query time looks something like this K
31:59
is a number of dimensions so n raise to
32:01
1 minus 1 over K plus the number of
32:03
points in the answer ok if you look at
32:07
this term here as K becomes large okay
32:14
so at K equals 2 for example this is
32:17
square root of 2 but at K equals Forte's
32:19
end to the 0.75 escape becomes large
32:22
this becomes close to n so that's like
32:27
you're looking at all the records okay
32:32
so worst case performance isn't good but
32:37
when you're curries are cubical which
32:40
means you're in the case of
32:42
two-dimensional looking for square type
32:44
boxes all the points within a square
32:47
roughly then you get good behavior on
32:53
average you look like log n plus s when
32:58
the number of points in the answer is
33:00
small and when the number of points in
33:03
the answer is large the log n kind of
33:05
gets to offed by the S so you just end
33:07
up with all of yes all right so the
33:11
structure is expected to have good
33:13
performance for roughly square or
33:16
cubical type curries and can get poor
33:20
for general curries
33:27
but it is a simple structure well let's
33:33
take a look at the next structure and
33:39
this is range trees and range trees are
33:43
going to be defined recursively we will
33:46
first define it for the case of having a
33:50
single key or one dimensional data and
33:52
then the case for two-dimensional data
33:55
will build upon the solution for one
33:58
dimensional three dimensional we made up
34:00
of many solutions for two dimensions and
34:02
so on so as a number of dimensions
34:04
increases the structure gets more
34:06
complex takes more space and more time
34:08
to search all right for the case of one
34:13
dimension is just a sorted array on that
34:15
one dimension okay so if you have just
34:19
one dimension I just take all the
34:20
records sort them out into increasing
34:22
order of ki and that's my structure the
34:30
time T constructed is n log n just run a
34:33
sort the space required is order ven and
34:37
to do a search you do first a binary
34:41
search on l1 okay so maybe that gets you
34:46
out here then you look at these serially
34:48
until you find the first one that's
34:50
outside the range so takes you log in to
34:54
do that binary search on every one and
34:56
then s units to pick up the records in
35:00
the range one by one
35:04
alright so that's fairly simple from
35:09
that we can build a good solution for
35:12
two for K equals two and the solution
35:21
has a binary tree on X followed by the
35:27
one-dimensional solution on Y stored in
35:29
each node of the binary tree right so
35:33
it's going to be a binary search tree on
35:35
the first coordinate
35:36
X
35:40
in each node we'll pick an x value which
35:42
splits the total number of points into
35:44
roughly two parts like we were doing in
35:46
the KD tree except now we're going to
35:50
pick X for every node if the x value is
35:57
less than equal to what you pick as the
35:59
partitioning and goes in the left
36:00
subtree just like in a KD tree otherwise
36:03
it goes into the right subtree okay so
36:09
we have a way to split the points at
36:10
each node based on their x value each
36:20
node is going to store a sorted array on
36:23
all of the points that are left with it
36:26
so when you start the roots about all
36:28
endpoints I'll have a sorted array and
36:30
all endpoints then you split into two
36:32
parts
36:33
so the left subtree you have a sorted
36:34
array on the N over two points it gets
36:35
and the right subtree will have a sorted
36:37
array on the N over two points it gets
36:44
and so on
36:49
as in the case of a Katy tree it's not
36:53
generally advantageous to keep the
36:55
splitting process going til you're down
36:57
to partitions of size 1 you typically
36:59
stop at a smaller I'm sorry at a larger
37:02
size partition alright so I'm going to
37:12
have a binary search tree each binary
37:15
search tree node is going to contain a
37:18
sorted array of all of the points it's
37:21
supposed to represent the root
37:24
represents all the points in the
37:25
collection so I have a sorted array on
37:28
y-values here all endpoints are stored
37:30
here and then I pick an x value to
37:33
partition the points along the x value
37:36
is the median so you get half of the
37:37
points here and half of the points there
37:43
right so half of those points are going
37:46
to show up here this node will contain
37:52
those half in sorted order on Y and then
37:57
you're going to pick a value to split on
38:00
ok again you're going to split only if
38:03
the number of points here is more than
38:05
the threshold ok so pick a median value
38:08
assuming it's more than threshold pick a
38:10
median value to split these points on
38:12
same thing happens on that side this
38:15
side gets the other half of the points
38:17
that were here and if that number is
38:20
more than the threshold you're going to
38:21
pick a median on X to do the split ok
38:25
alright so in this case I'm assuming
38:27
I've got more than the threshold so I'm
38:29
going to split these ok and now here
38:32
I'll get n over 4 points roughly there
38:35
I'll get n over 4 points and over 4 and
38:37
over for the union of these two sorted
38:40
arrays and Y will give me that one the
38:44
union of these two will give me this the
38:47
union of those two will give me that so
38:51
each node will have two children each
38:54
having roughly half of the points in
38:56
that node
38:59
and this could be split further if this
39:01
was bigger than the threshold you picked
39:03
the median of that Sadie and then split
39:05
them into n over eights and so on
39:09
alright so for the case of K equals two
39:13
you have a binary search tree on X each
39:16
node has the solution for the
39:19
one-dimensional case which is a sorted
39:21
array on Y all right so that's the
39:28
structure for K equals two unlike the KD
39:37
tree where here you might be splitting
39:41
on x value here on Y value there on x
39:43
value here on X and there on Y here the
39:47
binary search tree is always splitting
39:49
on X with each node you can associate an
39:58
X range okay so the X range of this node
40:02
could be say the smallest X here are
40:06
going to the biggest X here so the X
40:11
range of a node is the start of the
40:15
smallest x value of the points in that
40:18
node going up to the biggest x value all
40:25
right so let's say you want to search so
40:29
we start at the top if the X range of
40:34
the root okay it's contained inside the
40:38
curry so suppose that the X range of the
40:42
root is from 1 to 10 okay so all of the
40:46
points have x value between 1 and 10 and
40:48
your query is for values between 0 and
40:52
20 okay so then the X range of the root
40:56
is contained in the range of the query
40:57
and that tells you that all of the
41:01
points here satisfy the X part of the
41:05
query you only have to check the y part
41:07
okay so I then check the y part and
41:10
everybody here that satisfies the y part
41:12
is
41:13
the answer there's no point coming down
41:15
here there are no new points in a
41:18
subtree that aren't contained over here
41:21
okay so if you reach a node whose x
41:25
range is contained in the Curry's X
41:27
range you search its lower level
41:29
structure the sorted array in this case
41:31
and you terminate the search of that
41:33
node
41:33
you don't go further down right if the X
41:43
range of the query is to the left of the
41:49
of this point a ok so all the points
41:54
you're looking for have an x value less
41:56
than equal to a well then they have to
41:58
be on this side there can't be any on
42:00
this side because this sides got only
42:01
people whose x value is bigger than a so
42:04
in that case you search the left subtree
42:05
recursively if the x range of the query
42:11
is wholly on this side then you search
42:13
that subtree recursively if you straddle
42:22
then you have to search both left and
42:25
right subtrees all right so it's similar
42:30
to how we were searching Cayley trees
42:38
the main difference is that you can stop
42:42
right up here when you find that the X
42:45
range of this fellow is contained within
42:47
the Curie range let's see if there any
42:58
questions on how you search let me just
43:00
go back to that just like that okay
43:03
all right any questions about how you
43:05
search this thing the what the node
43:17
structure well each node has an x value
43:22
in it and each node has a lower level
43:25
structure in this case it has a solar
43:28
array the sorted array contains all of
43:32
the points that are represented by this
43:35
node the root represents all points in
43:37
the collection is children represent
43:39
roughly and over two points each the
43:42
grandchildren represent roughly and over
43:44
four points each okay so the x value
43:50
here is such that points in here that
43:56
are have an x value less than equal to a
43:58
show up in the left sub-tree points in
44:01
here that have an x value greater than a
44:03
show up in the right subtree and that
44:05
process repeats at every node okay
44:18
okay all right so let's look at the
44:22
performance of this pre-processing time
44:26
to set up the structure right flame is
44:32
it's n log n this is the true dimension
44:35
case to set up the structure I'm going
44:40
to first sort all of the points by Y
44:42
value
44:42
that'll give me this sorted array so
44:45
that takes n log n time okay then
44:51
having done that I need to find the
44:54
median I can run a median algorithm
44:56
which will take me order of n time I can
44:58
get this value then I need to split this
45:01
array into two parts to split the array
45:04
into two parts and just sweep it left to
45:06
right look at the x value if it's less
45:08
than equal to this icon sit here if it's
45:10
bigger than that I toss it out there so
45:13
these two come from that in linear time
45:17
okay then I need to find the median of
45:23
these guys so that takes n over two time
45:26
using the linear median finding
45:28
algorithm I need to find the median of
45:31
these X's that takes on over two times
45:32
the total of n to find the medians that
45:35
then I go to split these based on that
45:39
so I sweep this left to right if the x
45:40
value is less than equal to B I toss it
45:42
here if the x value is bigger than B I
45:45
toss it there so that takes n over two
45:47
to do the split and over two to do that
45:50
splits a total of n to do the split okay
45:55
all right so if you look at this over
45:59
the whole process there are n minus one
46:02
internal nodes at each one you have to
46:05
find a median but the total amount of
46:10
time you spent finding the medians okay
46:13
on this level we spent n units of time
46:15
to find these two medians on that level
46:18
you spent any units of time to find
46:20
those four mediums and over four four
46:22
okay so any units of time with each
46:25
level to find the medians needed at that
46:28
level
46:30
also it takes order of n time to split
46:33
these sorted arrays to the half soda to
46:36
raise for the next level okay so the
46:39
total time you spend per level is out of
46:41
n the number of levels is log n so your
46:45
total times n log n plus the initial n
46:48
log n to sort on why to create the route
46:51
so did array the space needed is also n
46:59
log n on each level you're storing n
47:06
records so you need order n space and
47:09
the total number of levels is log n so
47:12
you need n log n space to perform the
47:20
query using an argument similar to that
47:28
used for the trees structures we've seen
47:32
so far like segment trees you can show
47:35
that at each level at most two sorted
47:38
arrays can be searched you can't be
47:42
doing more than two and so the search
47:46
time at each level is two times that to
47:49
search a sorted array which was log n
47:52
plus the number of points reported and
47:55
the total number of levels is log n so
47:58
it becomes log square n plus the total
48:00
number of points reported let's take a
48:06
quick look at the case of three rights
48:11
very much like how we went from one to
48:13
two okay so now I've got three fields
48:16
called them W X Y I'm going to have a
48:19
binary search tree on W at the top level
48:23
okay at each node you're going to store
48:27
the median of the W values for the
48:31
points that node represents that's going
48:32
to be used to do the split same way if
48:38
the W value is less than equal to the
48:40
median it goes in the left subtree if
48:42
it's bigger it goes in the right subtree
48:47
now at each node instead of storing a
48:49
sorted array we're going to store the 2d
48:52
structure which means you're going to
48:55
have another binary search tree in each
48:57
node this will be on X and each binary
48:59
fellow will have a sorted array on Y and
49:05
again you stop petitioning when you
49:08
reach some threshold all right so I
49:15
picked the median of my double use of
49:17
all my points let's say that's a I'm
49:18
going to have a 2d structure for all
49:21
endpoints those points with W less than
49:24
eagled a will show up down here
49:26
there'll be roughly half of what I had
49:28
up there and those would double W value
49:31
bigger than a will show up on that side
49:33
and then you repeat that process until
49:37
you reach the threshold
49:48
search works in the same way if the W
49:55
range of this node is inside the query
49:57
we search the 2d structure I don't have
49:59
to search the subtrees if the W arranged
50:06
the query is less than equal to a there
50:09
aren't any points there we search the
50:11
left subtree if the W range in the query
50:14
is bigger than a we searched down here
50:17
and not the left subtree if you straddle
50:20
you search both the left and right
50:22
subtrees okay that's exactly what we
50:26
were doing for the 2d case if you look
50:30
at the performance we can again go
50:33
through the analysis we're going to go
50:34
up by one factor of log n on everything
50:37
okay
50:38
so the analysis is it's all the same the
50:43
pre-processing time goes up by a factor
50:45
of log n the reasoning is the same these
50:49
two lines are the same as we had before
50:50
just change the cost of doing one level
50:56
the space required goes up by a factor
50:58
of log n again the analysis is the same
51:01
as we had for the 2d case I have the
51:04
same lines as I had before previously
51:06
had a and out there now became n log N
51:16
the query time also goes up by a factor
51:21
of log n the reasoning is the same as we
51:25
had for the 2d case okay you'll notice
51:31
these are exactly the same bullets we
51:33
had before except that the time to
51:37
search one node has changed all right
51:44
you can now generalize this to any
51:45
number of dimensions in the same fashion
51:48
and each time you go one dimension
51:50
higher the factor of log goes up by one
51:54
on all three of the metrics alright so
52:06
depending upon the number of dimensions
52:07
you're dealing with your complexity
52:11
would vary each added dimension adds a
52:15
log n factor to the complexity okay any
52:26
questions
52:36
are these used in static databases I
52:41
can't answer that questions I don't
52:43
really know if you know if the static
52:47
databases are using this structure or
52:49
not yeah okay any other questions okay
52:57
so we stop here