Transcript 00:10 okay yeah are there any questions before 00:13 we start today no questions okay all 00:19 right so we're going to look at a new 00:23 problem which I guess in some sense is 00:26 related to problems you looked at before 00:28 this is a multi-dimensional range search 00:31 problem okay so when we were looking at 00:34 priority search trees we were handling 00:36 data two-dimensional data points and 00:39 with XY coordinates now I guess we look 00:43 at higher dimensions but the structures 00:45 will also be applicable to collections 00:48 of points in two dimensions the priority 00:54 search trees even though we were able to 00:57 handle points in two dimensions there 01:00 were requirements there that all of the 01:02 points should have a y value that's 01:03 written equal to zero and then all the 01:08 rectangles kind of made that assumption 01:10 that the bottom that the bottom of the 01:14 query was always y equals zero so you 01:16 really couldn't do a quarry efficiently 01:19 for a rectangle that was sitting 01:21 somewhere in arbitrary two-dimensional 01:24 space with a nonzero y value all right 01:27 so here 01:32 these structures we're going to look at 01:34 today for the static case so you're a 01:36 collection of points in 01:38 multi-dimensional space does not change 01:40 in time you know that collection ahead 01:43 of time and then you build a data 01:44 structure to perform curries in an 01:46 efficient manner all right so no changes 01:50 to the collection of points the only 01:53 operation being performed is one of 01:55 query of course there's the operation of 01:57 construct construct the data structure 01:59 initially the data we're dealing with is 02:08 k dimensional data so you have K code 02:13 and the curries are correspondingly 02:15 k-dimensional curries so for example 02:19 we'd be given case sets of ranges L 02:23 being a lower bound for the range you an 02:25 upper bound okay and then we need to 02:29 report all points or data in our 02:33 collection which fall within the given 02:36 range so in each of the K feels your 02:39 value lies between Li and UI so in the 02:45 case of two dimensions this would 02:47 correspond to being provided a rectangle 02:49 right k dimensional quarries show up 02:57 frequently in database type applications 03:00 so for example we may want to report all 03:04 employees in our collection with the 03:06 property that their age lies between 30 03:08 and 40 okay and the salary lies between 03:11 40,000 and 70,000 okay so you have a 03:14 two-dimensional query l1 is 30 u 1 is 40 03:19 l 2 is 40 L u 2 is 70 you'd go to higher 03:27 dimensional curries so in a geographic 03:30 database and report all the cities in 03:33 your collection that have rainfall 03:35 between 40 and 60 inches so l1 is 40 and 03:38 u 1 is 60 and where the population lies 03:43 between hundred thousand and two hundred 03:44 thousand so l2 is hundred thousand l2 is 03:47 two hundred thousand the average 03:50 temperature in the cities got to be at 03:51 least 70 degrees Fahrenheit so l3 is 70 03:55 and u3 is infinite and the number of 04:00 horses in the cities between thousand 04:02 twenty five and twenty five hundred so 04:04 you get a four dimensional query here 04:07 alright so when you're dealing with 04:09 collections of records whether they be 04:12 employee records or records that 04:15 describe property of cities we need to 04:17 be able to handle queries in multiple 04:20 dimensions and the number of dimensions 04:22 could get fairly large 04:26 well there's several data structures 04:29 that one can think of some of them a 04:33 simple straightforward others more 04:36 involved for solving this problem okay 04:40 so the simplest thing would be maybe say 04:43 no data structure at all you just take 04:46 your collection of records store them in 04:48 some arbitrary unordered fashion you get 04:50 a curry you just examine every single 04:52 record see which one satisfy the query 04:55 and report those alright next in 05:01 simplicity may be something called 05:03 sorted tables so you sort your 05:09 collection of Records on each of the K 05:12 keys of each of the K fields and you 05:18 build one table whereas table is just a 05:20 sorted collection of records for each of 05:22 those K Keys so you end up with K of 05:25 these tables and then when you get a 05:29 curry you examine each of these tables 05:34 so you look at table one and find out 05:37 how many records Y satisfy the l1 you 05:42 one range look at table to find out how 05:46 many satisfy the o2u to range look at 05:49 table 3 find out how many live in the l3 05:51 you three arrange and so on then you 05:54 take the smallest of these and examine 05:58 from that table more carefully to see if 06:01 they satisfy the remaining requirements 06:07 so if you have two dimensions okay then 06:12 maybe you first look at Table one and 06:14 find that there are 100 records that lie 06:17 between l1 and u1 look at Table two you 06:20 may find that there are only 20 records 06:23 that lie between l2 and YouTube so I'll 06:27 use the table to results take a look at 06:29 those 20 records and verify which ones 06:32 lie between l1 and u1 and report those 06:34 as the result 06:37 but then there's a structure called 06:39 cells I'm going to describe that here 06:41 you can take a look at that from the 06:43 reference something called KD trees 06:47 we're going to look at that structure in 06:49 detail today and something called range 06:54 trees we'll take a look at that 06:56 structure also in detail today 06:59 the other more involved structures Capel 07:02 trees we're not going to have time to go 07:04 through those in this class and then 07:09 there are K ranges we won't have time to 07:11 look at those either so we're going to 07:16 focus just on these two here today all 07:21 right so let's start by taking a look at 07:22 KD trees maybe before we look at that 07:28 let's put up the metrics we're going to 07:29 use to evaluate the structures so we're 07:33 going to be concerned with the amount of 07:35 time it takes to build the data 07:38 structure so that's the pre-processing 07:40 time P time required to build the data 07:43 structure to pre-process the data when 07:46 you have n records and K keys okay is 07:50 the number of dimensions for most 07:57 applications this is not a very 07:58 sensitive quantity you pre-process the 08:02 data once and then you search it many 08:05 many times and the only requirement 08:07 really is that the pre-processing time 08:10 be reasonable what does reasonable mean 08:13 in some contexts reasonable might mean 08:16 that an hour is enough in other contexts 08:19 a month may be enough just depends on on 08:25 how much you hope to gain by putting in 08:27 the extra pre-processing time so we may 08:33 not be terribly concerned with the 08:35 amount of time spent pre-processing 08:38 except when it becomes intuitively very 08:42 unreasonable and if it takes a hundred 08:43 years it's probably not a reasonable 08:46 amount of time 08:47 all right we be concerned with the 08:50 amount of space the structure takes so 08:53 the structure that's built as a result 08:55 of this pre-processing how much space 08:57 does it take and again the concern about 09:03 space would be very dependent upon the 09:08 application you have if you're going to 09:10 take the structure and store it on a on 09:14 a DVD and then sell these DVDs there's 09:19 no advantage to using say less than two 09:23 gigabytes of space if that's the 09:25 capacity of your DVD so getting the 09:28 space down to below the capacity of the 09:30 physical media may have zero value so 09:37 again depending upon how you intend to 09:39 use the structure exactly what value or 09:43 importance you want to place on space 09:45 would vary and then there's the Curie 09:50 time how much time does it take to 09:53 answer a query and that's really the 09:56 going to be the most important criteria 09:58 here because you're going to be 10:00 answering queries many many times you're 10:02 going to build it once space or long as 10:05 you can fit it in the media on which you 10:07 plan to run the queries that's probably 10:10 adequate all right so let's take a look 10:16 at the first at the first of the two 10:18 structures we want to study today this 10:20 one is the KD tree the KD tree is a 10:25 binary tree and basically at each node 10:33 of this binary tree you partition your 10:36 set of points into two roughly equal 10:39 groups for this petitioning you pick a 10:44 coordinate on which to cut the points 10:46 and once you pick a coordinate okay so 10:52 first you pick a coordinate and there 10:55 are different strategies to pick 10:56 coordinates but generally one uses this 10:59 heuristic here which says 11:01 pick a coordinate in which the values of 11:04 your points is has the maximum spread so 11:09 if you're looking ATS they're all of the 11:10 first coordinates you'll find the 11:12 smallest first coordinate the largest 11:14 was going to take the difference 11:16 there'll be the spread in that 11:17 coordinate so so pick one where you got 11:20 a good large spread once you've picked 11:24 the coordinate then for the actual point 11:26 on which to do the splitting of this set 11:29 of points you take the median value in 11:34 that coordinate I will see an example to 11:39 make that more clear okay so the node of 11:43 the binary tree will actually store the 11:46 coordinate on which you do the cut as 11:47 well as the value of the coordinate that 11:49 you use then all of the K dimensional 11:56 points in your collection which on that 11:59 particular coordinate have a value less 12:02 than equal to M will go into the left 12:04 subtree and those with the value bigger 12:07 than m will go in the right subtree so 12:11 the expectation is since you're picking 12:13 the median but half the points will show 12:14 up on this side and half will show up on 12:16 that side okay you end up with a slight 12:20 problem when there are a large number of 12:21 points that are equal to the median then 12:23 you don't get it good half split 12:34 you stop this petitioning process when 12:37 the number of K dimensional points in a 12:41 partition becomes small enough so 12:44 typically you don't go all the way down 12:45 to one point you have some threshold the 12:48 threshold I guess what I've got on the 12:51 slide is say 8 or 16 points again the 13:02 the the threshold in a real situation 13:07 may be determined by physical 13:09 characteristics of the media where 13:11 you're storing this thing the structure 13:13 so if you're storing it on a hard disk 13:16 then the retrieval unit from your disk 13:19 would be a block and so once you get a 13:22 partition down to a size of a block you 13:25 just leave it at that bring in the whole 13:26 block and search it using some in some 13:30 alternative structure okay all right so 13:34 we'll stop when the partition size 13:36 becomes small let's take a look at an 13:41 example and the example would be a two 13:45 dimensional case so I've got a 13:48 collection of points which are thrown 13:50 out in two-dimensional space or to 13:53 organize these into a 2d tree so the 2d 13:58 tree is going to be a binary tree in 14:01 fact doesn't matter whether you have two 14:03 dimensional or twenty dimensional points 14:04 it's always a binary tree okay for the 14:09 route I need to pick a dimension along 14:11 which to partition these points into two 14:14 groups okay depict the dimension I look 14:18 at the spread in X direction as well as 14:25 the spread in the y direction so the 14:27 spread in the X direction this point has 14:29 the lowest x value in this fellow I 14:32 guess up there has the highest x value 14:34 so the spread is you take the difference 14:36 in the two X values look at the spread 14:38 in the y direction so this one's got the 14:41 lowest Y and somebody up there's got the 14:43 highest Y look at that so here the 14:46 spread is more in the X direction so 14:48 pick the I choose to cut in the 14:51 x-direction alright then I need to 14:54 select a value of X to cut so it picked 14:57 the median of the X values of this 14:59 collection of points okay and that turns 15:03 out to be this x value here so points 15:09 that are less that have an x value less 15:11 than equal to this cut that's these 15:13 including this one on the line and the 15:16 cut line will show up on the left 15:17 sub-tree and the remaining points those 15:20 to the right of the cut line will show 15:22 up in the right subtree 15:25 alright so the route will contain 15:30 information about the coordinate that 15:32 you picked and the value of that 15:34 coordinate so even though my label here 15:37 just says a is the cut line in reality 15:40 what you distort here is the cutters on 15:42 X and the value of x is whatever 15:44 corresponds to this point okay all right 15:48 so in the left sub-tree I'm going to 15:50 have these three four five six seven 15:52 eight nine points and then the remaining 15:54 two four six eight nine points in the 15:57 right subtree alright so now for the 16:03 left subtree I've got these nine points 16:06 I've got to divide those into two groups 16:13 for this example I will assume that the 16:16 threshold at which I stop dividing is 16:18 three okay so since the number of points 16:22 is more than my threshold I got it 16:24 divided to divide I got to pick a 16:26 coordinate I look at the X spread so 16:29 expert is from here to over there 16:31 the Y spread is from here to over that 16:35 okay so the X spread is bigger I'm going 16:37 to divide on X and I pick a median X 16:41 okay so I end up picking this value of X 16:45 with this these one two three four five 16:48 points will be in the left subtree and 16:50 then these one two three four will be in 16:53 the right subtree 16:59 okay all right for the other side the 17:07 number of points exceeds my threshold of 17:09 three so again I got to pick a 17:13 coordinate to cut on I look at the 17:16 expert and the widespread the expert is 17:18 more than the why so then I got to pick 17:20 a median x-value to do the cut and it 17:24 turns out to be that one over there see 17:26 and so that'll give me these one two 17:28 three four five points in the left 17:30 subtree of c and these four and the 17:33 right subtree of C all right so the left 17:45 subtree of B has these five points and 17:48 since that's bigger than the threshold 17:51 I've got to split the left subtree of B 17:54 to split the left subtree I look at the 17:56 X pride so that's from here to there and 17:58 the widespread is from here to here so I 18:01 determine which one is bigger and then 18:04 in this case it's the x1 again and then 18:07 I pick a median X which turns out to be 18:10 D these three points end up in the left 18:12 subtree and these two end up in the 18:15 right subtree okay all right so I do 18:18 another cut of the left subtree of B 18:21 that's down the long D the right subtree 18:25 of B has one two three four points so 18:28 you got to do a cut the X spread is now 18:31 from here to there the wide spread is 18:33 from here to there 18:34 the wide spread is bigger so I do a cut 18:37 on why I picked the median Y which turns 18:40 out to be that fellow so these two 18:42 points are in the left subtree and those 18:45 two in the right subtree okay so you cut 18:48 on D all right so we cut at E now if you 18:56 go back a moment okay so the left 18:58 subtree of d as these three points it's 19:00 not going to be partitioned any more 19:02 because the threshold were using is 19:03 three the right subtree has these two 19:06 points that will not get partitioned 19:07 same thing here for 19:10 the left subtree has two points and the 19:13 right subtree has two points so those 19:16 will not be partitioned further and so 19:19 going into c okay the left subtree of c 19:23 is got one two three four five points we 19:26 have your partition I look at the x 19:28 spread 19:29 that's from I guess here to over there 19:32 and the wide spread is from here to 19:34 there 19:34 okay so the Y spread is more I picked 19:38 the median Y value which is f so these 19:43 three points in the left subtree and 19:45 those two in the right subtree going to 19:52 this side right subtree of see i've got 19:54 these four points smaller than the 19:56 threshold so you're going to split the x 19:59 spread is from here to there and the y 20:02 spread is also determined by those two 20:04 points okay so in this case the x spread 20:09 this distance is more than that distance 20:11 I pick an x value the median X is G so I 20:15 get these two to be in the left subtree 20:17 of C and the other two in the right 20:19 subtree of C okay so in this case none 20:22 of those four nodes that we have are 20:25 going to get split any further instead 20:28 they'll just point to buckets which 20:31 contain the respective points okay so 20:34 the left subtree of d has these three 20:37 points it's a bucket which has those 20:38 three points the right subtree of d is a 20:41 bucket that has these two points 20:58 all right so because you used the median 21:04 value to do the splits we expect that 21:07 the number of levels in this tree will 21:09 be log of n at each level the size of a 21:13 partition becomes roughly half all right 21:21 so in a KD tree you have a binary tree 21:25 which is used to petition all of your 21:27 points into a collection of buckets the 21:33 points themselves are stored in a bucket 21:35 each point is stored in exactly one 21:37 bucket let's take a look at how you 21:47 might perform a search so to perform a 21:50 search you are provided a curried 21:53 rectangle in the two-dimensional case so 21:56 I'm given a l1 value a u one value and 22:00 l2 value and a u2 value unlike priority 22:05 search trees where the l2 value has to 22:08 be 0 here the l2 value could be anything 22:14 right so with this query rectangle I'm 22:21 required to report all the points in 22:23 there and that's going to be this blue 22:25 point and that blue point here okay all 22:29 right this yellow line here is just to 22:31 the right of my country ok so to find 22:37 all the points that line the rectangle I 22:39 really need to find out which of these 22:42 partitions that I have overlap with this 22:45 rectangle and then search those 22:47 petitions each of these partitions 22:49 corresponds to a blue bucket okay so 22:53 this partition here doesn't overlap with 22:57 this ok again it's not evident from this 22:59 figure but this yellow line terminates 23:01 just a bit to the right of this cut okay 23:04 so this petition has no overlap so I 23:08 don't really need to search that bucket 23:11 this tall petition overlaps I go to 23:13 search that bucket this partition here 23:15 overlaps I go to search this bucket and 23:17 none of the others to perform the search 23:24 you start at the root okay 23:29 and you determine whether your Curie 23:32 rectangle in this case is entirely to 23:35 the left of this or entirely to the 23:41 right again when I say to the left would 23:42 mean if it's less than equal to it's 23:44 totally to the left otherwise is totally 23:47 to the right all right so so in this 23:55 particular case my query rectangle is 23:58 totally to the left of this so that 24:00 tells me there are no points in the 24:02 right subtree of a so I can cut that off 24:04 I don't need to search that then I move 24:07 here and I look at B and I see that 24:11 there is an overlap you could find 24:13 points in the left subtree of B as well 24:15 as the right subtree of B so I go to 24:17 search both the left and right subtrees 24:18 okay then when you come at D we see that 24:24 there's no possibility of points on the 24:25 left subtree of d because there's no 24:27 overlap so we don't need to go into that 24:29 subtree and if you're looking out here 24:34 in to be I not have to compare whether 24:37 you're above or below this cut line e 24:40 okay so from that node I know that this 24:44 is a horizontal cut line okay so I check 24:48 whether my rectangle is above or below 24:51 or if it straddles if it straddles you 24:53 got to do both sub trees okay so here 24:56 I'm below so I only have to do the left 24:58 subtree so a chop off the right all 25:04 right so we use the tree structure to 25:07 identify the buckets that need to be 25:09 examined closely and then in this case 25:13 I'll end up with these two buckets I 25:15 look at those two buckets and report the 25:17 points in there which actually lie 25:20 within the rectangle okay they may be 25:22 points in these 25:24 two buckets that do not line the 25:25 rectangle and in this particular case 25:28 yeah there are so this fellow here okay 25:32 and that fellow are not in the rectangle 25:36 even though they are in this particular 25:38 bucket all right so that's how the 25:42 structure works 25:57 or any questions about the structure 25:59 before we analyze its complexity yeah 26:16 the the threshold usually is not a 26:18 function of the initial number of points 26:20 it's really based on on looking at a 26:27 payoff between going down the tree 26:30 versus just searching a bucket directly 26:32 okay so the you have a cost associated 26:36 with moving down the tree because you 26:38 need to check whether you have a 26:39 rectangle overlap okay and depending 26:45 upon where this structure is sitting you 26:48 have a cost associated with accessing 26:49 this node from that media okay so at 26:53 some point it's cheaper to just take a 26:55 look at say 10 points or 20 points and 26:58 go through them than it is to go through 27:00 a binary search tree and then eventually 27:03 look at two points site so that's 27:07 independent of how many points you start 27:08 with it's more a characteristic of the 27:11 media you're dealing with okay all right 27:17 other questions 27:21 all right so let's look at the 27:23 performance all right first you create 27:27 the structure so to create the structure 27:30 takes n log n time and let's see why 27:34 that is so okay 27:36 all right since we're petitioning by 27:38 medians we expect to have log n levels n 27:41 is the number of records in your 27:44 collection at each level okay 27:50 so for example here I need to pick a 27:54 dimension to cut on so for that I look 27:57 at the spread so you got to find the 27:59 smallest X and largest X take the 28:02 difference smallest Y largest why take 28:04 the difference okay so you can do all 28:06 that in order n time well it's really K 28:15 times n okay but if you assume that K is 28:19 not part of the complexity measure then 28:21 it's just n okay 28:23 so once you fix the number of dimensions 28:25 we just the exam okay 28:27 so strictly as K times n okay but if you 28:32 zoom K is not a parameter here then over 28:34 here all right so you can find the 28:39 dimension to cut in order n time once 28:42 you've found the dimension you got to 28:44 find the median you can find a median of 28:46 n items in order n time okay so in order 28:49 n time you can determine this value okay 28:53 now when you come down here there are n 28:56 over 2 here so it takes you order and 28:58 over to to find the stuff here 29:00 similarly order in over to there so 29:02 total is again n same thing here n over 29:07 4 n over 4 and over 4 and over 4 total 29:09 of n so at each level you spend order n 29:13 time to find the dimension to cut on as 29:16 well as the medians and then to actually 29:20 do the partitioning of the points is 29:21 linear in the number of points you're 29:23 petitioning so you've got n over 2 you 29:25 can create the n over 4 for here and the 29:27 arrow for here in n over 2 time okay so 29:31 it takes you n time for a level to do 29:33 the petition 29:36 all right so total of log n levels at 29:39 each level you spend order n time to do 29:42 the work the works out a total of n log 29:43 n so there are other ways to arrive at n 29:47 log n you could say over there you could 29:53 just sort by if you're looking at say 2d 30:00 case sort by X sort by Y have two sorted 30:03 lists okay and then to find the spreader 30:06 just look at the endpoints of the two 30:08 lists compute the difference so now 30:10 instead of K and it's okay okay 30:13 the median is simple you just pick the 30:15 middle element from whichever list you 30:18 decide to cut on and then you split 30:21 those lists by marching down them left 30:23 to right you're retaining the sorted 30:25 order so again you saw it only once you 30:29 spend the N log N or if it's K 30:32 dimensions K n log n time to do the sort 30:35 on each dimension once and then you the 30:39 splitting is linear time as you go down 30:41 so it's still order n log yeah alright 30:46 so pre-processing takes n log n the 30:52 amount of space that you need is order 30:55 then the total number of buckets is 30:58 order of n if you view these as external 31:04 nodes to the tree the number of internal 31:06 nodes is one less than the number of 31:07 external nodes so the number of internal 31:10 nodes is n minus one at most so total 31:14 number of nodes in the system is n see a 31:15 total space is order thin 31:22 okay again this kind of a crude count 31:25 for your records are bigger than these 31:27 so maybe you want to say you need space 31:30 for n records plus n minus 1 internal 31:33 notes 31:42 alright the performance analysis is 31:44 somewhat difficult to perform and we're 31:48 not going to do that here I'm just going 31:49 to give the results the the worst case 31:56 query time looks something like this K 31:59 is a number of dimensions so n raise to 32:01 1 minus 1 over K plus the number of 32:03 points in the answer ok if you look at 32:07 this term here as K becomes large okay 32:14 so at K equals 2 for example this is 32:17 square root of 2 but at K equals Forte's 32:19 end to the 0.75 escape becomes large 32:22 this becomes close to n so that's like 32:27 you're looking at all the records okay 32:32 so worst case performance isn't good but 32:37 when you're curries are cubical which 32:40 means you're in the case of 32:42 two-dimensional looking for square type 32:44 boxes all the points within a square 32:47 roughly then you get good behavior on 32:53 average you look like log n plus s when 32:58 the number of points in the answer is 33:00 small and when the number of points in 33:03 the answer is large the log n kind of 33:05 gets to offed by the S so you just end 33:07 up with all of yes all right so the 33:11 structure is expected to have good 33:13 performance for roughly square or 33:16 cubical type curries and can get poor 33:20 for general curries 33:27 but it is a simple structure well let's 33:33 take a look at the next structure and 33:39 this is range trees and range trees are 33:43 going to be defined recursively we will 33:46 first define it for the case of having a 33:50 single key or one dimensional data and 33:52 then the case for two-dimensional data 33:55 will build upon the solution for one 33:58 dimensional three dimensional we made up 34:00 of many solutions for two dimensions and 34:02 so on so as a number of dimensions 34:04 increases the structure gets more 34:06 complex takes more space and more time 34:08 to search all right for the case of one 34:13 dimension is just a sorted array on that 34:15 one dimension okay so if you have just 34:19 one dimension I just take all the 34:20 records sort them out into increasing 34:22 order of ki and that's my structure the 34:30 time T constructed is n log n just run a 34:33 sort the space required is order ven and 34:37 to do a search you do first a binary 34:41 search on l1 okay so maybe that gets you 34:46 out here then you look at these serially 34:48 until you find the first one that's 34:50 outside the range so takes you log in to 34:54 do that binary search on every one and 34:56 then s units to pick up the records in 35:00 the range one by one 35:04 alright so that's fairly simple from 35:09 that we can build a good solution for 35:12 two for K equals two and the solution 35:21 has a binary tree on X followed by the 35:27 one-dimensional solution on Y stored in 35:29 each node of the binary tree right so 35:33 it's going to be a binary search tree on 35:35 the first coordinate 35:36 X 35:40 in each node we'll pick an x value which 35:42 splits the total number of points into 35:44 roughly two parts like we were doing in 35:46 the KD tree except now we're going to 35:50 pick X for every node if the x value is 35:57 less than equal to what you pick as the 35:59 partitioning and goes in the left 36:00 subtree just like in a KD tree otherwise 36:03 it goes into the right subtree okay so 36:09 we have a way to split the points at 36:10 each node based on their x value each 36:20 node is going to store a sorted array on 36:23 all of the points that are left with it 36:26 so when you start the roots about all 36:28 endpoints I'll have a sorted array and 36:30 all endpoints then you split into two 36:32 parts 36:33 so the left subtree you have a sorted 36:34 array on the N over two points it gets 36:35 and the right subtree will have a sorted 36:37 array on the N over two points it gets 36:44 and so on 36:49 as in the case of a Katy tree it's not 36:53 generally advantageous to keep the 36:55 splitting process going til you're down 36:57 to partitions of size 1 you typically 36:59 stop at a smaller I'm sorry at a larger 37:02 size partition alright so I'm going to 37:12 have a binary search tree each binary 37:15 search tree node is going to contain a 37:18 sorted array of all of the points it's 37:21 supposed to represent the root 37:24 represents all the points in the 37:25 collection so I have a sorted array on 37:28 y-values here all endpoints are stored 37:30 here and then I pick an x value to 37:33 partition the points along the x value 37:36 is the median so you get half of the 37:37 points here and half of the points there 37:43 right so half of those points are going 37:46 to show up here this node will contain 37:52 those half in sorted order on Y and then 37:57 you're going to pick a value to split on 38:00 ok again you're going to split only if 38:03 the number of points here is more than 38:05 the threshold ok so pick a median value 38:08 assuming it's more than threshold pick a 38:10 median value to split these points on 38:12 same thing happens on that side this 38:15 side gets the other half of the points 38:17 that were here and if that number is 38:20 more than the threshold you're going to 38:21 pick a median on X to do the split ok 38:25 alright so in this case I'm assuming 38:27 I've got more than the threshold so I'm 38:29 going to split these ok and now here 38:32 I'll get n over 4 points roughly there 38:35 I'll get n over 4 points and over 4 and 38:37 over for the union of these two sorted 38:40 arrays and Y will give me that one the 38:44 union of these two will give me this the 38:47 union of those two will give me that so 38:51 each node will have two children each 38:54 having roughly half of the points in 38:56 that node 38:59 and this could be split further if this 39:01 was bigger than the threshold you picked 39:03 the median of that Sadie and then split 39:05 them into n over eights and so on 39:09 alright so for the case of K equals two 39:13 you have a binary search tree on X each 39:16 node has the solution for the 39:19 one-dimensional case which is a sorted 39:21 array on Y all right so that's the 39:28 structure for K equals two unlike the KD 39:37 tree where here you might be splitting 39:41 on x value here on Y value there on x 39:43 value here on X and there on Y here the 39:47 binary search tree is always splitting 39:49 on X with each node you can associate an 39:58 X range okay so the X range of this node 40:02 could be say the smallest X here are 40:06 going to the biggest X here so the X 40:11 range of a node is the start of the 40:15 smallest x value of the points in that 40:18 node going up to the biggest x value all 40:25 right so let's say you want to search so 40:29 we start at the top if the X range of 40:34 the root okay it's contained inside the 40:38 curry so suppose that the X range of the 40:42 root is from 1 to 10 okay so all of the 40:46 points have x value between 1 and 10 and 40:48 your query is for values between 0 and 40:52 20 okay so then the X range of the root 40:56 is contained in the range of the query 40:57 and that tells you that all of the 41:01 points here satisfy the X part of the 41:05 query you only have to check the y part 41:07 okay so I then check the y part and 41:10 everybody here that satisfies the y part 41:12 is 41:13 the answer there's no point coming down 41:15 here there are no new points in a 41:18 subtree that aren't contained over here 41:21 okay so if you reach a node whose x 41:25 range is contained in the Curry's X 41:27 range you search its lower level 41:29 structure the sorted array in this case 41:31 and you terminate the search of that 41:33 node 41:33 you don't go further down right if the X 41:43 range of the query is to the left of the 41:49 of this point a ok so all the points 41:54 you're looking for have an x value less 41:56 than equal to a well then they have to 41:58 be on this side there can't be any on 42:00 this side because this sides got only 42:01 people whose x value is bigger than a so 42:04 in that case you search the left subtree 42:05 recursively if the x range of the query 42:11 is wholly on this side then you search 42:13 that subtree recursively if you straddle 42:22 then you have to search both left and 42:25 right subtrees all right so it's similar 42:30 to how we were searching Cayley trees 42:38 the main difference is that you can stop 42:42 right up here when you find that the X 42:45 range of this fellow is contained within 42:47 the Curie range let's see if there any 42:58 questions on how you search let me just 43:00 go back to that just like that okay 43:03 all right any questions about how you 43:05 search this thing the what the node 43:17 structure well each node has an x value 43:22 in it and each node has a lower level 43:25 structure in this case it has a solar 43:28 array the sorted array contains all of 43:32 the points that are represented by this 43:35 node the root represents all points in 43:37 the collection is children represent 43:39 roughly and over two points each the 43:42 grandchildren represent roughly and over 43:44 four points each okay so the x value 43:50 here is such that points in here that 43:56 are have an x value less than equal to a 43:58 show up in the left sub-tree points in 44:01 here that have an x value greater than a 44:03 show up in the right subtree and that 44:05 process repeats at every node okay 44:18 okay all right so let's look at the 44:22 performance of this pre-processing time 44:26 to set up the structure right flame is 44:32 it's n log n this is the true dimension 44:35 case to set up the structure I'm going 44:40 to first sort all of the points by Y 44:42 value 44:42 that'll give me this sorted array so 44:45 that takes n log n time okay then 44:51 having done that I need to find the 44:54 median I can run a median algorithm 44:56 which will take me order of n time I can 44:58 get this value then I need to split this 45:01 array into two parts to split the array 45:04 into two parts and just sweep it left to 45:06 right look at the x value if it's less 45:08 than equal to this icon sit here if it's 45:10 bigger than that I toss it out there so 45:13 these two come from that in linear time 45:17 okay then I need to find the median of 45:23 these guys so that takes n over two time 45:26 using the linear median finding 45:28 algorithm I need to find the median of 45:31 these X's that takes on over two times 45:32 the total of n to find the medians that 45:35 then I go to split these based on that 45:39 so I sweep this left to right if the x 45:40 value is less than equal to B I toss it 45:42 here if the x value is bigger than B I 45:45 toss it there so that takes n over two 45:47 to do the split and over two to do that 45:50 splits a total of n to do the split okay 45:55 all right so if you look at this over 45:59 the whole process there are n minus one 46:02 internal nodes at each one you have to 46:05 find a median but the total amount of 46:10 time you spent finding the medians okay 46:13 on this level we spent n units of time 46:15 to find these two medians on that level 46:18 you spent any units of time to find 46:20 those four mediums and over four four 46:22 okay so any units of time with each 46:25 level to find the medians needed at that 46:28 level 46:30 also it takes order of n time to split 46:33 these sorted arrays to the half soda to 46:36 raise for the next level okay so the 46:39 total time you spend per level is out of 46:41 n the number of levels is log n so your 46:45 total times n log n plus the initial n 46:48 log n to sort on why to create the route 46:51 so did array the space needed is also n 46:59 log n on each level you're storing n 47:06 records so you need order n space and 47:09 the total number of levels is log n so 47:12 you need n log n space to perform the 47:20 query using an argument similar to that 47:28 used for the trees structures we've seen 47:32 so far like segment trees you can show 47:35 that at each level at most two sorted 47:38 arrays can be searched you can't be 47:42 doing more than two and so the search 47:46 time at each level is two times that to 47:49 search a sorted array which was log n 47:52 plus the number of points reported and 47:55 the total number of levels is log n so 47:58 it becomes log square n plus the total 48:00 number of points reported let's take a 48:06 quick look at the case of three rights 48:11 very much like how we went from one to 48:13 two okay so now I've got three fields 48:16 called them W X Y I'm going to have a 48:19 binary search tree on W at the top level 48:23 okay at each node you're going to store 48:27 the median of the W values for the 48:31 points that node represents that's going 48:32 to be used to do the split same way if 48:38 the W value is less than equal to the 48:40 median it goes in the left subtree if 48:42 it's bigger it goes in the right subtree 48:47 now at each node instead of storing a 48:49 sorted array we're going to store the 2d 48:52 structure which means you're going to 48:55 have another binary search tree in each 48:57 node this will be on X and each binary 48:59 fellow will have a sorted array on Y and 49:05 again you stop petitioning when you 49:08 reach some threshold all right so I 49:15 picked the median of my double use of 49:17 all my points let's say that's a I'm 49:18 going to have a 2d structure for all 49:21 endpoints those points with W less than 49:24 eagled a will show up down here 49:26 there'll be roughly half of what I had 49:28 up there and those would double W value 49:31 bigger than a will show up on that side 49:33 and then you repeat that process until 49:37 you reach the threshold 49:48 search works in the same way if the W 49:55 range of this node is inside the query 49:57 we search the 2d structure I don't have 49:59 to search the subtrees if the W arranged 50:06 the query is less than equal to a there 50:09 aren't any points there we search the 50:11 left subtree if the W range in the query 50:14 is bigger than a we searched down here 50:17 and not the left subtree if you straddle 50:20 you search both the left and right 50:22 subtrees okay that's exactly what we 50:26 were doing for the 2d case if you look 50:30 at the performance we can again go 50:33 through the analysis we're going to go 50:34 up by one factor of log n on everything 50:37 okay 50:38 so the analysis is it's all the same the 50:43 pre-processing time goes up by a factor 50:45 of log n the reasoning is the same these 50:49 two lines are the same as we had before 50:50 just change the cost of doing one level 50:56 the space required goes up by a factor 50:58 of log n again the analysis is the same 51:01 as we had for the 2d case I have the 51:04 same lines as I had before previously 51:06 had a and out there now became n log N 51:16 the query time also goes up by a factor 51:21 of log n the reasoning is the same as we 51:25 had for the 2d case okay you'll notice 51:31 these are exactly the same bullets we 51:33 had before except that the time to 51:37 search one node has changed all right 51:44 you can now generalize this to any 51:45 number of dimensions in the same fashion 51:48 and each time you go one dimension 51:50 higher the factor of log goes up by one 51:54 on all three of the metrics alright so 52:06 depending upon the number of dimensions 52:07 you're dealing with your complexity 52:11 would vary each added dimension adds a 52:15 log n factor to the complexity okay any 52:26 questions 52:36 are these used in static databases I 52:41 can't answer that questions I don't 52:43 really know if you know if the static 52:47 databases are using this structure or 52:49 not yeah okay any other questions okay 52:57 so we stop here