Transcript 00:34 we also have an index when FX is a 00:37 dictionary it contains all of the keys 00:41 that are in our file and associated 00:45 associated with each key we have an 00:47 address which tells us where we can find 00:49 the corresponding record on the disc 00:52 okay 00:53 so a typical mode of operating this 00:55 simple database is somebody gives you a 00:59 key you want to perform a search you 01:02 would then first search the index the 01:04 index will tell you where the record is 01:05 you will then go and get that record 01:08 from the appropriate place on the disk 01:12 the operations we want to perform is the 01:17 sum of the search operation also called 01:19 retrieve given a key find the record 01:22 with the corresponding key we also want 01:26 to be able to update our records where 01:29 by update we could mean you may want to 01:32 insert a new record into the collection 01:33 you may want to make a change in record 01:37 or you may want to remove an exhibit so 01:44 basically we have a file of records the 01:47 file is augmented with an index 01:48 structure which is supposed to improve 01:50 the efficiency of the operations you 01:53 want to perform on this file a 01:59 simplistic way of operating this system 02:03 is somebody wants to perform a search 02:09 operation you take the key you then 02:12 search the index the index tells you 02:15 where the record is and then you get the 02:20 record from the file which is the 02:22 collection of records and reports so in 02:26 a search operation you would first 02:29 search the index and then you would get 02:34 the record 02:37 the number of disk accesses you would 02:39 perform would be the number you search 02:42 the index assuming the index is stored 02:45 on a disk so we assume the index itself 02:47 is very large there sitting on a disk 02:49 okay and so you spend some number of 02:51 disk accesses to search the index and 02:53 then you do one additional one to get 02:56 the record from the proper place on the 02:57 disk okay all right similarly if you're 03:04 doing a delete operation you would first 03:07 search the index find out where it is 03:09 you would then remove the key from the 03:11 index and you would also remove the 03:12 record from the file okay you'd need to 03:16 reclaim the space occupied by the record 03:18 so you can later reuse that space if you 03:21 want to do an insert you would insert 03:23 the key into the index you would also 03:25 insert the record here all right so 03:30 that's a fairly simple way to operate 03:33 the system and it seems like a fairly 03:36 intuitive way to do it also 03:40 now some problems that as you're doing 03:49 updates as we've seen the index here 03:53 this dictionary changes so if you do an 03:56 insert you could add a key associated 03:58 address to it if you do a delete you 04:02 would remove a key from here if you're 04:05 just doing a change you may need to 04:08 change this associated with a key 04:10 because you may have to move the record 04:12 from its current place on the disk to 04:14 some other place if you're increasing 04:16 the size of the record you may have to 04:17 go to a larger contiguous chunk of space 04:20 on your desk okay so the index changes 04:23 in time 04:32 price of the first observation coupled 04:35 with the second one which says sooner or 04:37 later this thing's gonna crash it's just 04:40 a fact of life at some point in time 04:43 you're going to have a failure and in 04:47 order to recover from failures you keep 04:49 backups okay so we have to keep back up 04:54 so that we can recover from failure and 04:57 in order to recover we will need to make 05:03 a copy of the backup that you have of 05:05 this file to call that the master file 05:07 so you go to backup setting some ways 05:09 you make a copy of that you make a copy 05:13 of the index so you have a backup index 05:16 sitting somewhere you make a copy of 05:17 that of course once you made a copy you 05:21 haven't recovered your stage because 05:23 from the top you fade the most recent 05:27 backup of the index and the file you're 05:29 probably yep okay 05:49 all right so once you've made a copy of 05:54 the master file and a copy of the master 05:57 index you haven't restored the state to 05:59 what it was at the time of the crash 06:01 because this backup probably wasn't made 06:05 just before the crash this backup might 06:08 have been made a day earlier or an hour 06:10 earlier a couple of days earlier 06:12 whenever whenever you made the last 06:14 backup 06:15 and so this backup here is out of sync 06:20 with the copy of the file at the time of 06:23 the crash and this index here is out of 06:25 sync with the index at the time of the 06:28 crash so when you copy you're still out 06:30 of sync okay and in order to get this 06:37 thing back to the state it had at a time 06:39 of the crash you need to once again 06:41 process all of the transactions actually 06:46 all of the updates that took place 06:49 between the time you made the last 06:51 backup and the time of the crash okay so 06:56 hopefully you've got all of these stored 06:57 somewhere so that you can now process 06:59 all of these updates against the old 07:04 index and old file to recover the state 07:07 that the index and the file had at the 07:09 time of the crash okay and we're going 07:12 to assume you've got all of these stored 07:18 all right so the time required to 07:22 recover from a crash okay 07:24 it depends on how big the master file is 07:30 because your time to copy it would be a 07:31 function of how big it is also depend 07:34 upon how big the master indexes the time 07:36 to copy that would be a function of how 07:38 large it is but more importantly the 07:41 size of these two fellows would 07:45 determine how much time it takes you to 07:47 recover in terms of having to process 07:49 these transactions okay the size of 07:53 these two fellows determines the time 07:56 required to process these transactions 07:58 in two different ways one directly 08:02 okay if you're doing inserts and deletes 08:05 into an index it's a function of how 08:07 large the indexes so in that sense the 08:17 time to process transactions becomes a 08:20 function of how large the indexes okay 08:23 indirectly the time to process these 08:28 transactions is a function of the size 08:30 of the index and file because if the 08:32 index and file are very large you can't 08:35 be backing them up too frequently okay 08:38 so if these are so large that it takes 08:40 you hours to do the backup well then 08:43 it's less likely that you're backing 08:45 these fellows up every five minutes more 08:47 likely you're backing it once a day okay 08:50 if it takes you weeks to backup chances 08:52 are your last backup was a few weeks old 08:54 which means that the number of 08:56 transactions that have accumulated would 08:59 tend to be larger 09:00 when these files are bigger then when 09:04 these fellows are smaller because you 09:05 simply can't be backing up more 09:06 frequently the larger the item that 09:09 you're trying to backup okay so so the 09:13 recovery time becomes a function of how 09:16 large your index also how large your 09:18 file is for two reasons one the time 09:22 through the copying second the time to 09:24 do the processing the number of 09:26 transactions tends to be large the 09:31 larger these fellows are because you 09:32 can't backup large files too frequently 09:39 all right so that gets us to the concept 09:42 of a differential file to try and 09:44 improve the performance of this simple 09:47 database system in the face of crashes a 09:56 differential file or the differential 09:59 file strategy is one which says well 10:03 don't make any changes to the master 10:04 file keep that static when you get a 10:09 request to update you write the new 10:12 version of the record or the inserted 10:13 record into a brand new file which we'll 10:16 call it 10:16 Frenship file okay so in some sense the 10:19 differential file keeps the difference 10:21 between the static old file and what the 10:25 current state of the file would have 10:26 been had you been making these changes 10:28 right in the file okay so now what I 10:35 have is I have my old collection of 10:38 records and this is what's been backed 10:40 up in the master file and here I keep 10:44 any newly inserted records or changed 10:49 versions of old records okay 10:52 the index is the same as it was before 10:55 you have an entry for every different 10:58 key and with each key you keep an 11:01 address we tells you where it is so when 11:04 you start the system you by which I mean 11:08 just after you've backed up the index 11:10 and you backed up the file the 11:12 differential file is going to be empty 11:15 then you get a request to insert well 11:19 you're going to insert the record into 11:21 the differential file you make an N and 11:26 also an insertion into the index and the 11:29 associated address will point you here 11:34 you're asked to make a change to a 11:37 record if the record is sitting out here 11:40 you will not change the version of the 11:43 record that's here you leave it there 11:44 instead you change the entry in the 11:46 index so that it points to some space 11:49 here and you keep the change version of 11:51 the record here if you're doing a search 11:55 you look at the index the index always 11:58 points you to the most recent version of 12:00 the record okay so you're always finding 12:02 the changed record or the inserted 12:04 record you come here only if you're 12:06 looking for a record that was never 12:08 changed so you still perform everything 12:11 correctly 12:16 all right before seeing the merits of a 12:18 differential file do you have any 12:19 questions about how a differential file 12:21 works right you see the merit of this 12:30 system okay right when you start the 12:36 differential file is small then you 12:38 perform some updates the differential 12:40 file begins to grow okay so at least for 12:44 some period of time the differential 12:45 file is fairly small relative to this 12:48 fellow here what that means is you can 12:51 backup this guy with much greater 12:53 frequency then you could backup this 12:55 fellow in turn what that means is that 13:00 in case of a failure the number of 13:02 transactions since your last backup 13:04 would be smaller than in the previous 13:06 mode because we backed up this fellow 13:08 more frequently okay so in this mode 13:15 whenever you back up the differential 13:17 file will also back up the index because 13:19 the index is also changing under the 13:23 assumption that the index is much 13:24 smaller in size than the file you have 13:27 potential to back these two fellows up 13:29 far more frequently then you could have 13:31 afforded to backup these two guys 13:39 all right so the recovery time is 13:41 reduced because to do a recovery you 13:43 copy the master file you copied the 13:46 master index which is now not all that 13:49 old okay and you start with an empty 13:54 differential file you look at the 13:56 transactions since you've asked so 14:00 you're not with an empty differential 14:01 you start with the backup differential 14:02 file the backup index and the 14:04 transaction since you did the last 14:06 backup the transactions now are expected 14:09 to be much smaller than in the previous 14:10 mode and so should take much less time 14:12 to reconstruct the state of the 14:15 differential file at the time of the 14:16 crash and to reconstruct the state of 14:19 the index at the time of the crash okay 14:22 so this process reduces the recovery 14:27 time or any questions about why recovery 14:35 time is reduced 14:44 yes right in both cases we are assuming 14:54 that the transaction since the last 14:56 backup are saved somewhere and so the 14:59 assumption now is that the transaction 15:01 file is much much smaller than the index 15:05 the master file in the differential file 15:06 okay so you can always back this up with 15:08 greater frequency than you can back the 15:10 others so in in a real situation you 15:16 what you may have is you've got the 15:18 transactions backed up somewhere but 15:21 maybe you've missed the last few in 15:23 which case you going to have to have 15:25 some way of recovering what they were 15:26 okay but unless you have all the 15:28 transactions there's really no way to 15:30 recover or another thing to denote here 15:46 is this improving recovery time has been 15:52 achieved with really no loss in 15:54 performance okay if you've set up your 15:58 index say as a b-tree index or really 16:06 whatever kind of index you have and you 16:09 compared this scheme to the previous 16:10 scheme we had where we didn't have the 16:12 differential file okay so you're 16:15 performing an operation you come in you 16:17 always have to search the index here you 16:19 have to search it there the index size 16:21 may be slightly bigger here let me think 16:28 about that would it be bigger actually 16:30 it wouldn't be bigger either because the 16:35 index is going to have one entry for 16:37 every distinct key okay which is the 16:41 same here as it was in the previous case 16:42 so the index eyes are the same so the 16:46 cost of searching the index here is the 16:48 same as the cost of searching the index 16:49 before here this fellow tells you 16:54 whether to go here or to go over there 16:57 okay so the cost of searching the index 16:59 plus one more access either here or 17:01 there to get what you need and that's 17:03 exactly what you were paying in the 17:04 previous case so there's no loss in 17:07 performance 17:08 and when a crash occurs the recovery 17:13 time is better here okay all right so 17:18 this is kind of a win-win type thing 17:26 rather this advantage with the scheme of 17:28 course is our comments kind of assume 17:32 that the differential file was going to 17:34 be small and sitting back it up 17:35 frequently but certainly eventually the 17:38 differential file is going to become big 17:41 I mean if you go on like this forever 17:42 the differential file would become 17:43 bigger than the master file and what 17:49 that means is that periodically you have 17:52 to integrate the differential file with 17:54 the file and create a new version of the 17:57 master file yep 18:03 differential file to differential file 18:05 well I suppose you could just keep going 18:08 on forever and have layers but I think 18:10 eventually at some point you want to 18:12 combine everything okay now there is a 18:15 space penalty here which becomes larger 18:17 as you go to more versions the space 18:22 penalty arises because if you have a 18:23 record here you could be keeping the new 18:27 version of it here so you're paying 18:30 double the space for that record and 18:32 then if you go to a third level you 18:34 could have three copies of it just more 18:37 recent versions yeah 18:48 right 18:53 alright the comment is when you come 18:55 here and you want to do this integration 18:57 okay of the file in the differential 19:00 file well then where do you keep 19:02 information that tells you that the old 19:04 copy here is to be removed and replaced 19:06 by the new copy here so if you think 19:08 about combining these two fellows 19:11 together okay you would either need to 19:15 have stored somewhere all of the 19:17 transactions say all of the records that 19:19 were deleted say okay which information 19:23 is is implicitly present in the index 19:26 because now those keys are lost from the 19:28 index okay so at this time what you'd 19:33 really be doing is you'd be trying to 19:34 create a clean version of the index a 19:36 clean version of the file and a clean 19:41 version might for example mean if these 19:43 were all say arranged in ascending order 19:45 of key you might first take these and 19:48 arrange these in ascending order of key 19:49 and then merge the two together removing 19:51 duplicates okay so that way all the old 19:55 versions of the old version of the 19:57 record would vanish okay when you have 20:02 the index you my tree construct the 20:04 whole index and when you're doing that 20:06 in that process you could look against 20:09 these fellows and remove the records for 20:11 which you don't have a key okay that'd 20:14 be one way to do it the other way would 20:15 of course be that you saw stored all of 20:18 the transaction somewhere at least the 20:20 delete transactions that way you know 20:22 which records have gone and you can 20:23 throw them out okay and the changed ones 20:26 you can always find from here alright so 20:30 that of course is a lot of work and 20:32 that's why it's listed here as a 20:33 disadvantage you do have to do the 20:35 integration which is going to take some 20:37 amount of time and the nice thing about 20:42 the integration is you can do the 20:44 integration in the background see you 20:47 don't have to bring your whole system to 20:49 a halt if you've got the state of the 20:51 system stored from yesterday in the 20:53 background you can be integrating 20:55 so that your master file an index look 20:57 like they did yesterday and then you can 20:59 keep going from there okay so you do 21:06 have to now perform the integration and 21:08 the integration does take time but it 21:09 can be done in the background all right 21:19 in the ideal case following integration 21:21 the differential file is empty but if 21:22 you're doing it in the background it 21:24 really isn't because what you have 21:27 integrated to isn't the most current 21:29 state of the system all right now this 21:40 way of operating works fine so long as 21:44 the index isn't too big we want the 21:49 index to not be too large because we 21:52 want to be able to backup the index with 21:53 the same frequency we were backing up 21:55 the differential file so if the index is 21:58 huge the scheme that we just talked 22:00 about isn't all that great because even 22:05 though it may be feasible to back this 22:08 follow up every five minutes it may not 22:10 be practical to back this follow up 22:12 every five minutes so when you're 22:18 dealing with a large index well then you 22:20 might say well if differential worked 22:23 nicely here it ought to work nicely here 22:26 too so maybe we have a differential 22:27 index as well as a differential file 22:35 okay all right so when you have a large 22:37 index the previous scheme has some 22:39 disadvantages to it and so we might 22:43 think about using a differential index 22:47 the strategy is the same the master 22:51 index or at least your copy working copy 22:53 of the master index has never changed 22:54 it's exactly a copy of what you've 22:57 backed up and then you have a 23:02 differential index and that's where you 23:05 record all changes that have been made 23:09 I would respect to the records that are 23:12 stored here so this index is now a 23:15 faithful index into here everybody 23:17 points here you have a differential file 23:19 and it has its own separate index which 23:21 keeps track of the people in the 23:22 differential file all right so things 23:29 look like this now you've got your file 23:31 you got a differential file you have an 23:33 index into the differential index and 23:35 you have another index which is into 23:37 this thing your master a backup file is 23:40 a faithful image of this your backup 23:43 index is a faithful image of that and 23:45 this fellow only indexes into this guy 23:48 okay you have to pretty much operate in 23:53 this way when you get a request you want 23:55 to find the most current version of the 23:57 record for the key so you need to check 23:59 the differential index first if you find 24:02 the key in the differential index it 24:04 points you into the differential file 24:05 and you get the current version of the 24:07 record if you don't find what you're 24:10 looking for in the differential index 24:11 then you search the index to see if it's 24:14 in the big file okay so by operating in 24:20 this fashion you always guaranteed to 24:23 find the most current version of the 24:25 record 24:32 the problem with the scheme is that you 24:34 pay a performance penalty you pay a 24:38 performance penalty because everything 24:43 you do first has to go through the 24:45 differential index okay which is 24:47 something you didn't have in the loop 24:49 before so the number of disk accesses 24:51 per operation goes up you got to do this 24:55 stuff here and then after that for 24:58 almost everybody you'll be coming here 24:59 okay again unless a large fraction of 25:04 these records have changed or something 25:06 you would expect that most of your 25:08 queries would kind of follow this path 25:11 you're not looking for something that's 25:13 changed you're looking for something old 25:15 and you come down here okay all right so 25:19 you're going to be paying a performance 25:20 hit because now you have to pay the 25:22 extra price of searching this index 25:36 all right to overcome this performance 25:39 set we're going to introduce a filter 25:41 into the scheme so we're going to try 25:44 and put in a filter somewhere I guess 25:47 we'll see it in the next slide somewhere 25:48 up here which will be used to decide 25:51 whether you should be going into the 25:53 differential index or you should not and 25:55 he should just be going into the indexed 25:56 rate okay this filter in order to be 26:03 useful would need to be small enough 26:04 that can be resident in memory so that 26:07 there's no disk access associated with 26:10 searching the filter alright so what 26:16 we're looking for is something that 26:18 looks like this the filter the filter is 26:20 resident in memory you get a key you 26:23 search the photo and the filter says yes 26:26 that tells you that you're looking for a 26:28 changed record you need to go into the 26:29 differential index if the filter says no 26:32 then it means that what you're looking 26:36 for is not a change record it's an old 26:38 ones so original you have to kind of go 26:40 through here okay so if you had a filter 26:44 setup like this well then there's no 26:46 performance hit ok if you're looking for 26:48 an old fellow the number of disk 26:49 accesses is the same as you would have 26:51 had before looking for a new fellow this 26:54 index being smaller would take fewer 26:56 disk accesses presumably to search and 26:58 then you'd come here for the additional 26:59 access to get the rugged okay so if you 27:03 could operate in this fashion there 27:06 would be no performance hit but if you 27:14 think about what this filter is doing 27:16 okay this filter is doing exactly what 27:19 the differential index is doing the 27:22 differential index takes a key and tells 27:24 you whether or not the corresponding 27:25 record is in the differential file and 27:28 that's what this filter is doing takes a 27:29 key and tells you whether or not the 27:31 corresponding record is in the 27:32 differential file okay so these two 27:36 fellows are serving an identical 27:38 function and if you could build this 27:41 filter that would sit in memory you 27:43 could certainly build this index which 27:45 would sit in memory too okay so these 27:48 two flowers are functionally equivalent 27:49 and what that tells you is that an ideal 27:52 filter doesn't exist at least not one 27:58 that meets our needs and so from that 28:03 you move from the quest for an ideal 28:07 filter to the quest for a non-ideal 28:09 filter okay and that gets us to a bloom 28:13 filter so what a bloom filter does is 28:17 okay I've placed a bloom filter here 28:20 where previously I had that ideal filter 28:23 the ideal filter taken a key if it gave 28:27 you the answer no that meant that there 28:30 was no record in the differential file 28:33 with that key if it said yes 28:36 that meant there is a record in the 28:38 differential file with that key okay a 28:41 bloom filter is unable to give you that 28:44 yes/no type of answer okay well the 28:47 bloom filter does is when it says no 28:50 it's accurate it means there is no 28:54 record in the differential file with 28:56 that key when it doesn't say no it gives 29:00 you the answer maybe okay so no really 29:04 means no and instead of a yes here which 29:08 the ideal filter has this fellow gives 29:10 you a maybe okay it says well maybe 29:13 there is a record here with that key 29:16 okay so whenever it says or maybe that's 29:19 when you have to kind of go and search 29:21 the differential index and at this time 29:23 the index is the real test 29:26 and if the index says yes you get it 29:28 from here if the index says no you have 29:30 to come back yeah so the only time you 29:35 have a performance hit is when something 29:40 called a filter error occurs which is 29:43 the filter says maybe and then the 29:46 differential index says no okay so if 29:50 the filter says maybe in the 29:52 differential indexes yes there's no 29:54 error okay it sent you on the right path 29:56 okay but if the photo says maybe and the 29:59 indexes know you were irori you were 30:02 erroneously sent on this path you could 30:03 have been sent directly here okay so we 30:07 call this a filter error when you're 30:10 sent on this thing here indicated by 30:12 these two black arrows if you end up 30:14 following that path you have a filter 30:16 error all right so this scheme works in 30:25 that it shows you that you will always 30:28 access the most current version of the 30:30 record the only time you have a 30:36 performance hit is when you have a 30:40 filter error if there are no errors 30:42 there's no performance hit OK again as 30:47 was the case with the ideal filter the 30:49 blue filters going to reside in memory 30:51 so there are no disk accesses associated 30:53 with searching the searching or querying 30:56 the filter all right so the objective 31:03 then becomes to design a bloom filter 31:05 which would minimize the probability of 31:08 a filter error 31:15 right your designer bloom filter what 31:19 we're going to do is we're going to grab 31:20 hold of as much internal memory as the 31:23 application can spare so let's call that 31:26 M bits so the filter is going to be M 31:29 bits long the more memory you can grab 31:36 for the filter the less the chances of a 31:39 filter error all of the bits of our 31:44 filter will be 0 when the differential 31:47 index is empty and as the index gets 31:51 populated some of these bits will get 31:53 changed to 1 we're going to employ a 31:58 collection of hash functions to 32:00 determine which bits to change to 1 32:02 whenever you enter a new key into the 32:05 differential index so we'll use up to H 32:11 functions so this will become an 32:14 optimization parameter in a moment to 32:19 try and decide how many we should use 32:21 okay so you could use 1 2 3 4 5 32:25 depending upon the situation right so 32:33 whenever we insert a key into the 32:35 differential index we will compute our 32:38 set of hash functions f1 through FHA and 32:41 set all of the corresponding bits in the 32:44 filter to 1 right so a bloom filter is 32:50 just a sequence of M bits coupled with H 32:54 hash functions the hash functions need 32:57 to be relatively independent so that if 32:59 you were given a key K f1 k you would 33:02 like that to be different from F to K 33:04 different from f3 different from a 4 5 33:06 up to FH so the way of operation is 33:13 every time you insert a new key into 33:15 your differential index we will go and 33:17 set some bits and the bloom filter to 1 33:20 and those bits are obtained by computing 33:22 the H hash functions that are in use 33:26 okay so given a key K will refer to the 33:30 values of those H hash functions as 33:32 defining the signature of the key also 33:40 get an example of how our filter works 33:42 alright so whenever you back up the 33:48 differential file in differential sorry 33:52 not when you back when you back it up 33:54 once you've done the integration okay so 33:58 say you've done an integration and the 34:01 differential file is empty okay 34:03 and so the differential index is empty 34:05 at that time you've bloom filter is 34:07 zeroed out well then you get requests to 34:11 perform updates and keys get entered 34:14 into the differential index and records 34:17 into the differential file and at that 34:20 time some of these zeros will change to 34:22 ones alright so in this case I've got 34:26 eleven bits again just so you don't 34:30 think this is the right representative 34:33 of the magnitude of M in typical 34:35 applications the number of bits you 34:37 might use would be in the order of 34:38 millions or tens of millions okay so 34:40 we're looking at a large number of bits 34:42 to make up one of these filters I'm 34:46 assuming for this example we've got two 34:48 hash functions again depending upon the 34:51 relationship between m and another 34:53 parameter that we'll see in a moment the 34:56 optimal choice for H could be one could 34:59 be two three four five all right the two 35:05 hash functions I'm going to use the 35:06 first one is just take the key and take 35:10 the remainder divided divided by M take 35:12 the remainder the other one is going to 35:14 take two times the key more then if you 35:19 get the key 15 so i compute 15 mod 11 35:23 that's a 4 so bit 4 will be changed to a 35:27 1 of compute 30 mod 11 that's an 8 so 35:33 bit 8 will get changed to a 1 okay 35:36 the signature for the key 15 is 1 is 4 35:39 and those two bits get changed to one if 35:44 you get the key 17 you compute 17 mod 11 35:47 which is a six so bit six will get 35:50 changed to a one and then you compute 34 35:53 mod 11 which is a one so but one gets 35:59 changed okay so 1/6 is the signature for 36:02 17 okay so whenever you get a new key 36:05 you computer signature and set the 36:07 appropriate bits to 1 all right now 36:16 suppose you get a request to search for 36:18 a record for a particular key K okay 36:22 well then you compute its signature if 36:25 any of the bits corresponding to its 36:28 signature are 0 you know with certainty 36:31 that key is not in the differential 36:33 index had it been that signature would 36:36 have all bits equal to 1 so if any bit 36:39 in the signature is 0 you know with 36:41 certainty it's not in the differential 36:42 index on the other hand if all bits are 36:46 1 you don't know with certainty that the 36:49 keys in the index it may be it may not 36:51 okay for example if K was 6 okay if you 36:59 compute the signature of this 6 mod 11 37:03 is 6 the bit is 1 12 mod 11 is 1 and 37:08 that's 1 okay so it's signature 1 6 has 37:13 all its bits equal to 1 but we know that 37:15 our differential index only has 15 and 37:18 17 okay so you get a filter error 37:24 all right so filter errors can occur 37:26 with the scheme okay and that's why you 37:30 have that maybe branch 37:35 okay now let's look at how you design 37:37 the filter okay now we need to know M 37:42 the number of bits how big is the filter 37:44 going to be this is really not an 37:48 optimization parameter we know that the 37:52 more the larger M is the better so you 37:55 look at the Machine you're going to work 37:56 on the application grab hold of as much 37:58 memory as you can okay so you want m to 38:02 be the largest that you can get we need 38:08 to know the number of hash functions 38:10 okay so that's something you can control 38:13 if you use a small number of functions 38:18 say H of 1 then the chances that two 38:21 keys have the same signature is higher 38:23 than if you use a larger number of hash 38:25 functions okay so the chances of getting 38:31 a filter error become more if the number 38:37 of hash functions is too small because 38:39 different keys will end up with the same 38:41 signature with a higher probability on 38:45 the other hand if your H is too large 38:48 well then too many of those bits will 38:50 get set to 1 very quickly okay so that's 38:56 not good either 38:57 ok so a small H is not good a large H is 39:01 not good something in the middle and we 39:02 need to decide well what is that 39:04 something between small and large that 39:06 we should be using we want to select the 39:12 function so that a relatively 39:13 independent which means that when you 39:16 compute F 1 F 2 F 3 you would like all 39:19 of these values to be different as much 39:22 of the time as possible ok all right so 39:28 the thing is now how do we choose the 39:30 number of hash functions right the 39:35 probability of filter error depends on 39:36 the filter size M oh that's going to be 39:42 as large as possible it depends on the 39:44 number of hash functions another thing 39:46 we haven't talked about is 39:47 it depends on how many updates you 39:49 tolerate before you integrate the whole 39:52 thing okay 39:55 so that controls how big does the 39:57 differential index get how big does the 39:59 differential file get and therefore it 40:03 also controls how many ones you get in 40:05 that bloom filter okay so so how many 40:10 update requests do you want to allow 40:14 before you zero out the differential 40:17 filter or you integrate everything so we 40:22 assume that that's provided to you so 40:26 maybe you decide that you're going to 40:28 integrate every day and then a day you 40:31 get ten thousand updates or you're going 40:34 to integrate every two days and you get 40:35 a million updates every two days okay 40:37 so we're going to assume that you is 40:39 known how many updates you can tolerate 40:42 before integration and we're going to 40:46 assume that all of these updates are 40:48 really two different records if some of 40:50 them to this are to the same well then 40:52 your performance is better than the 40:54 analysis would show all right so you is 40:59 a count on really how many inserts 41:01 deletes and changes before integration 41:07 so we're going to assume that M when you 41:09 are known we're going to assume that the 41:14 total number of records in your original 41:15 file is very large compared to the 41:17 number of updates you can tolerate 41:18 between integrations and what that 41:22 really means is that the total number of 41:24 records in the system pretty much stays 41:27 the same if you take n and you do you 41:30 inserts you get n plus you records but U 41:32 is very small compared to R and so you 41:34 still have about n records if you delete 41:37 end records from delete u records from n 41:40 well you still have about n records all 41:45 right so the probability of a filter 41:47 error 41:52 is highest just before integration 41:55 that's when you got maximum number of 41:58 ones in that filter so we look at the 42:03 highest probability for an error so 42:05 that's just after you've done that 42:06 update before integration it has two 42:10 components to it an A and a B okay in 42:13 order to have a filter error 42:14 you must it must be the case that you 42:18 have to be requesting a record that has 42:21 not been changed if you're requesting a 42:23 record that has been changed you go to 42:25 the differential index it's going to say 42:27 yes so you don't get a filter error okay 42:29 you have to follow that black path that 42:30 may be path okay so to follow that black 42:34 or maybe path there are two lengths 42:35 there and for one of those links you 42:40 have to be looking for an unmodified 42:42 record okay for the other link it must 42:47 be that or the signature is one okay so 42:50 the first link requires the signature to 42:52 be one and the second link requires that 42:55 it's an unmodified record okay so it's 42:58 the probability of both of these events 43:00 occurring for a given key 43:06 all right so you need to compute that 43:10 I'm going to skip the computation 43:12 because we are running out of time but 43:14 it's there on your things let's skip the 43:19 derivation okay so there's a slide that 43:23 derives I there's another one that 43:24 derives the B part here right so the 43:32 probability of an error is the a times B 43:35 and you take the stuff you derived for a 43:37 and B you end up with an equation that 43:38 looks like this then we make use of this 43:44 approximation which is valid for the 43:46 case where the number X here is large so 43:50 that means assuming that n is a large 43:51 number the total number of records the 43:54 original file is large in assuming here 43:57 that the number of bits in your filter 43:58 is watch okay so on under those 44:01 assumptions this approximation will be 44:02 valid and then we can replace this stuff 44:05 with an expression that looks like this 44:09 our objective is to minimize the 44:12 probability of a filter error so we're 44:14 going to differentiate this with respect 44:15 to H H is our optimization parameter 44:18 number of hash functions so you 44:22 differentiate with respect to H set it 44:24 to 0 you end up with something which 44:26 says about 0.7 m divided by u hash 44:30 functions again the point where the 44:36 derivative is 0 could either be a max or 44:38 a min okay well we know from our 44:40 previous discussion the max point is 44:44 either going to be when H is smaller 44:45 when H is large ok you'll get two maxes 44:49 and you'll have a min somewhere in 44:50 between so most likely it's a men but 44:53 you can verify that by drawing it okay 44:57 you can pluck the fellow out and it's 44:59 going to even have a kind of a 45:00 bell-shaped curve which will tell us 45:03 that what we computed there where the 45:05 derivative was 0 is in fact a minute 45:08 okay alternatively you can take the 45:11 second derivative of this function 45:15 alright so the analysis shows that under 45:20 suitable assumptions the optimal choice 45:21 of H is given by this expression and if 45:26 you plug in some possible values say you 45:29 got a filter with the million bits but 45:32 you need to tolerate half a million 45:35 updates before integration so if you 45:41 plug those numbers in over there you 45:45 need about 1.4 hash functions because 45:49 what that means is you either use one or 45:50 you use two if the bail is pretty flat 45:56 down here it would appear that since 45:58 this is closer to one you'd probably 45:59 want to use one the function value for P 46:02 is going to be smaller at one than at 46:04 one point then a two 46:13 on the other hand if you have two 46:15 million bits in your filter and you need 46:18 to tolerate half a million updates 46:21 between integrations you plug in you get 46:24 like two point seven seven and from the 46:28 nature of the function we know you would 46:31 either be using two or three you can 46:33 compute the function at H equals two at 46:35 H equals three see where it's smaller 46:36 but if this is a somewhat symmetric 46:39 thing then at three you would expect it 46:40 to be smaller than a two because this 46:42 number is closer to three all right so 46:47 you can go through the analysis making 46:49 suitable assumptions and then figure out 46:53 how many hash functions you should be 46:55 using to minimize the probability of 46:57 error all right any questions okay 47:24 okay before we start today you have any 47:27 questions you like to ask all right 47:33 today we're going to look at a new data 47:37 structure this one is called the segment 47:40 tree and this particular structure is 47:46 one of the most basic structures used in 47:48 the solution of computational geometry 47:50 problems the field of computational 47:55 geometry itself is a very interesting 47:57 field which deals with solving problems 48:02 defined on objects either in 48:04 one-dimensional two or three dimensional 48:07 or higher dimensional space so we're 48:10 talking about geometric objects so for 48:11 example you may be talking about points 48:14 points in one dimension two three or 48:17 more as an example of such a problem you 48:24 may be given a collection of points in 48:26 three dimensions which perhaps 48:27 represents the current location of 48:29 aircraft in space and ask to find the 48:35 pair of aircraft that are closest to 48:37 that particular time so you might use 48:39 that as a way to decide whether two 48:42 aircraft have become too close to each 48:44 other for safety so given a collection 48:47 of points in three dimensions find the 48:50 two that are closest and then if that 48:52 distance is below a threshold you might 48:54 signal some kind of an along or in two 48:57 dimensions you may have a have a design 49:03 say on a sheet piece of sheet metal 49:05 where the design indicates a bunch of 49:08 places where you're going to be drilling 49:09 holes so given the location of these 49:13 holes you may be interested in verifying 49:15 that no two holes are so close that when 49:18 you try to drill them you will have 49:19 metal fatigue or a tear in the metal 49:22 okay so again you're given a collection 49:25 of points in two dimensions find the 49:27 pair that is closest if the distance is 49:30 below a threshold you would probably 49:33 declare this design as an impractical 49:34 design 49:36 so people study closest pair of points 49:41 problems in many dimensions find the 49:46 best algorithms for detecting or for 49:49 determining the pair of points that are 49:51 closest it's a beard kind of a basic 49:55 problem studied in computational 49:57 geometry a related problem would be find 50:02 the nearest point to a given location 50:04 either in 1 2 3 or higher dimensional 50:08 space so for example you may be driving 50:13 around in a city and given your 50:16 coordinates as measured maybe by the 50:20 longitude and latitude of your position 50:22 you may be interested in finding the 50:25 nearest gas station okay so the gas 50:28 stations would be the known points and 50:30 your current point is something that 50:32 would vary it would be given to a 50:34 program as input your current location 50:37 and using a stored set of gas station 50:41 locations we'd need to find the nearest 50:43 point or nearest neighbor to your 50:44 particular current location or find the 50:49 nearest restaurant to where you are 50:50 equivalent type things okay all right so 50:53 people study in terms of point problems 50:56 they're commonly studied problems are 50:58 closest pair of points and find the 51:01 nearest neighbor to a given point in 51:07 terms of line problems in computational 51:10 geometry again people look at asking 51:13 questions with respect to collections of 51:15 lines either lines in one dimension two 51:17 three or higher 51:18 [Music]