Transcript


00:34
we also have an index when FX is a
00:37
dictionary it contains all of the keys
00:41
that are in our file and associated
00:45
associated with each key we have an
00:47
address which tells us where we can find
00:49
the corresponding record on the disc
00:52
okay
00:53
so a typical mode of operating this
00:55
simple database is somebody gives you a
00:59
key you want to perform a search you
01:02
would then first search the index the
01:04
index will tell you where the record is
01:05
you will then go and get that record
01:08
from the appropriate place on the disk
01:12
the operations we want to perform is the
01:17
sum of the search operation also called
01:19
retrieve given a key find the record
01:22
with the corresponding key we also want
01:26
to be able to update our records where
01:29
by update we could mean you may want to
01:32
insert a new record into the collection
01:33
you may want to make a change in record
01:37
or you may want to remove an exhibit so
01:44
basically we have a file of records the
01:47
file is augmented with an index
01:48
structure which is supposed to improve
01:50
the efficiency of the operations you
01:53
want to perform on this file a
01:59
simplistic way of operating this system
02:03
is somebody wants to perform a search
02:09
operation you take the key you then
02:12
search the index the index tells you
02:15
where the record is and then you get the
02:20
record from the file which is the
02:22
collection of records and reports so in
02:26
a search operation you would first
02:29
search the index and then you would get
02:34
the record
02:37
the number of disk accesses you would
02:39
perform would be the number you search
02:42
the index assuming the index is stored
02:45
on a disk so we assume the index itself
02:47
is very large there sitting on a disk
02:49
okay and so you spend some number of
02:51
disk accesses to search the index and
02:53
then you do one additional one to get
02:56
the record from the proper place on the
02:57
disk okay all right similarly if you're
03:04
doing a delete operation you would first
03:07
search the index find out where it is
03:09
you would then remove the key from the
03:11
index and you would also remove the
03:12
record from the file okay you'd need to
03:16
reclaim the space occupied by the record
03:18
so you can later reuse that space if you
03:21
want to do an insert you would insert
03:23
the key into the index you would also
03:25
insert the record here all right so
03:30
that's a fairly simple way to operate
03:33
the system and it seems like a fairly
03:36
intuitive way to do it also
03:40
now some problems that as you're doing
03:49
updates as we've seen the index here
03:53
this dictionary changes so if you do an
03:56
insert you could add a key associated
03:58
address to it if you do a delete you
04:02
would remove a key from here if you're
04:05
just doing a change you may need to
04:08
change this associated with a key
04:10
because you may have to move the record
04:12
from its current place on the disk to
04:14
some other place if you're increasing
04:16
the size of the record you may have to
04:17
go to a larger contiguous chunk of space
04:20
on your desk okay so the index changes
04:23
in time
04:32
price of the first observation coupled
04:35
with the second one which says sooner or
04:37
later this thing's gonna crash it's just
04:40
a fact of life at some point in time
04:43
you're going to have a failure and in
04:47
order to recover from failures you keep
04:49
backups okay so we have to keep back up
04:54
so that we can recover from failure and
04:57
in order to recover we will need to make
05:03
a copy of the backup that you have of
05:05
this file to call that the master file
05:07
so you go to backup setting some ways
05:09
you make a copy of that you make a copy
05:13
of the index so you have a backup index
05:16
sitting somewhere you make a copy of
05:17
that of course once you made a copy you
05:21
haven't recovered your stage because
05:23
from the top you fade the most recent
05:27
backup of the index and the file you're
05:29
probably yep okay
05:49
all right so once you've made a copy of
05:54
the master file and a copy of the master
05:57
index you haven't restored the state to
05:59
what it was at the time of the crash
06:01
because this backup probably wasn't made
06:05
just before the crash this backup might
06:08
have been made a day earlier or an hour
06:10
earlier a couple of days earlier
06:12
whenever whenever you made the last
06:14
backup
06:15
and so this backup here is out of sync
06:20
with the copy of the file at the time of
06:23
the crash and this index here is out of
06:25
sync with the index at the time of the
06:28
crash so when you copy you're still out
06:30
of sync okay and in order to get this
06:37
thing back to the state it had at a time
06:39
of the crash you need to once again
06:41
process all of the transactions actually
06:46
all of the updates that took place
06:49
between the time you made the last
06:51
backup and the time of the crash okay so
06:56
hopefully you've got all of these stored
06:57
somewhere so that you can now process
06:59
all of these updates against the old
07:04
index and old file to recover the state
07:07
that the index and the file had at the
07:09
time of the crash okay and we're going
07:12
to assume you've got all of these stored
07:18
all right so the time required to
07:22
recover from a crash okay
07:24
it depends on how big the master file is
07:30
because your time to copy it would be a
07:31
function of how big it is also depend
07:34
upon how big the master indexes the time
07:36
to copy that would be a function of how
07:38
large it is but more importantly the
07:41
size of these two fellows would
07:45
determine how much time it takes you to
07:47
recover in terms of having to process
07:49
these transactions okay the size of
07:53
these two fellows determines the time
07:56
required to process these transactions
07:58
in two different ways one directly
08:02
okay if you're doing inserts and deletes
08:05
into an index it's a function of how
08:07
large the indexes so in that sense the
08:17
time to process transactions becomes a
08:20
function of how large the indexes okay
08:23
indirectly the time to process these
08:28
transactions is a function of the size
08:30
of the index and file because if the
08:32
index and file are very large you can't
08:35
be backing them up too frequently okay
08:38
so if these are so large that it takes
08:40
you hours to do the backup well then
08:43
it's less likely that you're backing
08:45
these fellows up every five minutes more
08:47
likely you're backing it once a day okay
08:50
if it takes you weeks to backup chances
08:52
are your last backup was a few weeks old
08:54
which means that the number of
08:56
transactions that have accumulated would
08:59
tend to be larger
09:00
when these files are bigger then when
09:04
these fellows are smaller because you
09:05
simply can't be backing up more
09:06
frequently the larger the item that
09:09
you're trying to backup okay so so the
09:13
recovery time becomes a function of how
09:16
large your index also how large your
09:18
file is for two reasons one the time
09:22
through the copying second the time to
09:24
do the processing the number of
09:26
transactions tends to be large the
09:31
larger these fellows are because you
09:32
can't backup large files too frequently
09:39
all right so that gets us to the concept
09:42
of a differential file to try and
09:44
improve the performance of this simple
09:47
database system in the face of crashes a
09:56
differential file or the differential
09:59
file strategy is one which says well
10:03
don't make any changes to the master
10:04
file keep that static when you get a
10:09
request to update you write the new
10:12
version of the record or the inserted
10:13
record into a brand new file which we'll
10:16
call it
10:16
Frenship file okay so in some sense the
10:19
differential file keeps the difference
10:21
between the static old file and what the
10:25
current state of the file would have
10:26
been had you been making these changes
10:28
right in the file okay so now what I
10:35
have is I have my old collection of
10:38
records and this is what's been backed
10:40
up in the master file and here I keep
10:44
any newly inserted records or changed
10:49
versions of old records okay
10:52
the index is the same as it was before
10:55
you have an entry for every different
10:58
key and with each key you keep an
11:01
address we tells you where it is so when
11:04
you start the system you by which I mean
11:08
just after you've backed up the index
11:10
and you backed up the file the
11:12
differential file is going to be empty
11:15
then you get a request to insert well
11:19
you're going to insert the record into
11:21
the differential file you make an N and
11:26
also an insertion into the index and the
11:29
associated address will point you here
11:34
you're asked to make a change to a
11:37
record if the record is sitting out here
11:40
you will not change the version of the
11:43
record that's here you leave it there
11:44
instead you change the entry in the
11:46
index so that it points to some space
11:49
here and you keep the change version of
11:51
the record here if you're doing a search
11:55
you look at the index the index always
11:58
points you to the most recent version of
12:00
the record okay so you're always finding
12:02
the changed record or the inserted
12:04
record you come here only if you're
12:06
looking for a record that was never
12:08
changed so you still perform everything
12:11
correctly
12:16
all right before seeing the merits of a
12:18
differential file do you have any
12:19
questions about how a differential file
12:21
works right you see the merit of this
12:30
system okay right when you start the
12:36
differential file is small then you
12:38
perform some updates the differential
12:40
file begins to grow okay so at least for
12:44
some period of time the differential
12:45
file is fairly small relative to this
12:48
fellow here what that means is you can
12:51
backup this guy with much greater
12:53
frequency then you could backup this
12:55
fellow in turn what that means is that
13:00
in case of a failure the number of
13:02
transactions since your last backup
13:04
would be smaller than in the previous
13:06
mode because we backed up this fellow
13:08
more frequently okay so in this mode
13:15
whenever you back up the differential
13:17
file will also back up the index because
13:19
the index is also changing under the
13:23
assumption that the index is much
13:24
smaller in size than the file you have
13:27
potential to back these two fellows up
13:29
far more frequently then you could have
13:31
afforded to backup these two guys
13:39
all right so the recovery time is
13:41
reduced because to do a recovery you
13:43
copy the master file you copied the
13:46
master index which is now not all that
13:49
old okay and you start with an empty
13:54
differential file you look at the
13:56
transactions since you've asked so
14:00
you're not with an empty differential
14:01
you start with the backup differential
14:02
file the backup index and the
14:04
transaction since you did the last
14:06
backup the transactions now are expected
14:09
to be much smaller than in the previous
14:10
mode and so should take much less time
14:12
to reconstruct the state of the
14:15
differential file at the time of the
14:16
crash and to reconstruct the state of
14:19
the index at the time of the crash okay
14:22
so this process reduces the recovery
14:27
time or any questions about why recovery
14:35
time is reduced
14:44
yes right in both cases we are assuming
14:54
that the transaction since the last
14:56
backup are saved somewhere and so the
14:59
assumption now is that the transaction
15:01
file is much much smaller than the index
15:05
the master file in the differential file
15:06
okay so you can always back this up with
15:08
greater frequency than you can back the
15:10
others so in in a real situation you
15:16
what you may have is you've got the
15:18
transactions backed up somewhere but
15:21
maybe you've missed the last few in
15:23
which case you going to have to have
15:25
some way of recovering what they were
15:26
okay but unless you have all the
15:28
transactions there's really no way to
15:30
recover or another thing to denote here
15:46
is this improving recovery time has been
15:52
achieved with really no loss in
15:54
performance okay if you've set up your
15:58
index say as a b-tree index or really
16:06
whatever kind of index you have and you
16:09
compared this scheme to the previous
16:10
scheme we had where we didn't have the
16:12
differential file okay so you're
16:15
performing an operation you come in you
16:17
always have to search the index here you
16:19
have to search it there the index size
16:21
may be slightly bigger here let me think
16:28
about that would it be bigger actually
16:30
it wouldn't be bigger either because the
16:35
index is going to have one entry for
16:37
every distinct key okay which is the
16:41
same here as it was in the previous case
16:42
so the index eyes are the same so the
16:46
cost of searching the index here is the
16:48
same as the cost of searching the index
16:49
before here this fellow tells you
16:54
whether to go here or to go over there
16:57
okay so the cost of searching the index
16:59
plus one more access either here or
17:01
there to get what you need and that's
17:03
exactly what you were paying in the
17:04
previous case so there's no loss in
17:07
performance
17:08
and when a crash occurs the recovery
17:13
time is better here okay all right so
17:18
this is kind of a win-win type thing
17:26
rather this advantage with the scheme of
17:28
course is our comments kind of assume
17:32
that the differential file was going to
17:34
be small and sitting back it up
17:35
frequently but certainly eventually the
17:38
differential file is going to become big
17:41
I mean if you go on like this forever
17:42
the differential file would become
17:43
bigger than the master file and what
17:49
that means is that periodically you have
17:52
to integrate the differential file with
17:54
the file and create a new version of the
17:57
master file yep
18:03
differential file to differential file
18:05
well I suppose you could just keep going
18:08
on forever and have layers but I think
18:10
eventually at some point you want to
18:12
combine everything okay now there is a
18:15
space penalty here which becomes larger
18:17
as you go to more versions the space
18:22
penalty arises because if you have a
18:23
record here you could be keeping the new
18:27
version of it here so you're paying
18:30
double the space for that record and
18:32
then if you go to a third level you
18:34
could have three copies of it just more
18:37
recent versions yeah
18:48
right
18:53
alright the comment is when you come
18:55
here and you want to do this integration
18:57
okay of the file in the differential
19:00
file well then where do you keep
19:02
information that tells you that the old
19:04
copy here is to be removed and replaced
19:06
by the new copy here so if you think
19:08
about combining these two fellows
19:11
together okay you would either need to
19:15
have stored somewhere all of the
19:17
transactions say all of the records that
19:19
were deleted say okay which information
19:23
is is implicitly present in the index
19:26
because now those keys are lost from the
19:28
index okay so at this time what you'd
19:33
really be doing is you'd be trying to
19:34
create a clean version of the index a
19:36
clean version of the file and a clean
19:41
version might for example mean if these
19:43
were all say arranged in ascending order
19:45
of key you might first take these and
19:48
arrange these in ascending order of key
19:49
and then merge the two together removing
19:51
duplicates okay so that way all the old
19:55
versions of the old version of the
19:57
record would vanish okay when you have
20:02
the index you my tree construct the
20:04
whole index and when you're doing that
20:06
in that process you could look against
20:09
these fellows and remove the records for
20:11
which you don't have a key okay that'd
20:14
be one way to do it the other way would
20:15
of course be that you saw stored all of
20:18
the transaction somewhere at least the
20:20
delete transactions that way you know
20:22
which records have gone and you can
20:23
throw them out okay and the changed ones
20:26
you can always find from here alright so
20:30
that of course is a lot of work and
20:32
that's why it's listed here as a
20:33
disadvantage you do have to do the
20:35
integration which is going to take some
20:37
amount of time and the nice thing about
20:42
the integration is you can do the
20:44
integration in the background see you
20:47
don't have to bring your whole system to
20:49
a halt if you've got the state of the
20:51
system stored from yesterday in the
20:53
background you can be integrating
20:55
so that your master file an index look
20:57
like they did yesterday and then you can
20:59
keep going from there okay so you do
21:06
have to now perform the integration and
21:08
the integration does take time but it
21:09
can be done in the background all right
21:19
in the ideal case following integration
21:21
the differential file is empty but if
21:22
you're doing it in the background it
21:24
really isn't because what you have
21:27
integrated to isn't the most current
21:29
state of the system all right now this
21:40
way of operating works fine so long as
21:44
the index isn't too big we want the
21:49
index to not be too large because we
21:52
want to be able to backup the index with
21:53
the same frequency we were backing up
21:55
the differential file so if the index is
21:58
huge the scheme that we just talked
22:00
about isn't all that great because even
22:05
though it may be feasible to back this
22:08
follow up every five minutes it may not
22:10
be practical to back this follow up
22:12
every five minutes so when you're
22:18
dealing with a large index well then you
22:20
might say well if differential worked
22:23
nicely here it ought to work nicely here
22:26
too so maybe we have a differential
22:27
index as well as a differential file
22:35
okay all right so when you have a large
22:37
index the previous scheme has some
22:39
disadvantages to it and so we might
22:43
think about using a differential index
22:47
the strategy is the same the master
22:51
index or at least your copy working copy
22:53
of the master index has never changed
22:54
it's exactly a copy of what you've
22:57
backed up and then you have a
23:02
differential index and that's where you
23:05
record all changes that have been made
23:09
I would respect to the records that are
23:12
stored here so this index is now a
23:15
faithful index into here everybody
23:17
points here you have a differential file
23:19
and it has its own separate index which
23:21
keeps track of the people in the
23:22
differential file all right so things
23:29
look like this now you've got your file
23:31
you got a differential file you have an
23:33
index into the differential index and
23:35
you have another index which is into
23:37
this thing your master a backup file is
23:40
a faithful image of this your backup
23:43
index is a faithful image of that and
23:45
this fellow only indexes into this guy
23:48
okay you have to pretty much operate in
23:53
this way when you get a request you want
23:55
to find the most current version of the
23:57
record for the key so you need to check
23:59
the differential index first if you find
24:02
the key in the differential index it
24:04
points you into the differential file
24:05
and you get the current version of the
24:07
record if you don't find what you're
24:10
looking for in the differential index
24:11
then you search the index to see if it's
24:14
in the big file okay so by operating in
24:20
this fashion you always guaranteed to
24:23
find the most current version of the
24:25
record
24:32
the problem with the scheme is that you
24:34
pay a performance penalty you pay a
24:38
performance penalty because everything
24:43
you do first has to go through the
24:45
differential index okay which is
24:47
something you didn't have in the loop
24:49
before so the number of disk accesses
24:51
per operation goes up you got to do this
24:55
stuff here and then after that for
24:58
almost everybody you'll be coming here
24:59
okay again unless a large fraction of
25:04
these records have changed or something
25:06
you would expect that most of your
25:08
queries would kind of follow this path
25:11
you're not looking for something that's
25:13
changed you're looking for something old
25:15
and you come down here okay all right so
25:19
you're going to be paying a performance
25:20
hit because now you have to pay the
25:22
extra price of searching this index
25:36
all right to overcome this performance
25:39
set we're going to introduce a filter
25:41
into the scheme so we're going to try
25:44
and put in a filter somewhere I guess
25:47
we'll see it in the next slide somewhere
25:48
up here which will be used to decide
25:51
whether you should be going into the
25:53
differential index or you should not and
25:55
he should just be going into the indexed
25:56
rate okay this filter in order to be
26:03
useful would need to be small enough
26:04
that can be resident in memory so that
26:07
there's no disk access associated with
26:10
searching the filter alright so what
26:16
we're looking for is something that
26:18
looks like this the filter the filter is
26:20
resident in memory you get a key you
26:23
search the photo and the filter says yes
26:26
that tells you that you're looking for a
26:28
changed record you need to go into the
26:29
differential index if the filter says no
26:32
then it means that what you're looking
26:36
for is not a change record it's an old
26:38
ones so original you have to kind of go
26:40
through here okay so if you had a filter
26:44
setup like this well then there's no
26:46
performance hit ok if you're looking for
26:48
an old fellow the number of disk
26:49
accesses is the same as you would have
26:51
had before looking for a new fellow this
26:54
index being smaller would take fewer
26:56
disk accesses presumably to search and
26:58
then you'd come here for the additional
26:59
access to get the rugged okay so if you
27:03
could operate in this fashion there
27:06
would be no performance hit but if you
27:14
think about what this filter is doing
27:16
okay this filter is doing exactly what
27:19
the differential index is doing the
27:22
differential index takes a key and tells
27:24
you whether or not the corresponding
27:25
record is in the differential file and
27:28
that's what this filter is doing takes a
27:29
key and tells you whether or not the
27:31
corresponding record is in the
27:32
differential file okay so these two
27:36
fellows are serving an identical
27:38
function and if you could build this
27:41
filter that would sit in memory you
27:43
could certainly build this index which
27:45
would sit in memory too okay so these
27:48
two flowers are functionally equivalent
27:49
and what that tells you is that an ideal
27:52
filter doesn't exist at least not one
27:58
that meets our needs and so from that
28:03
you move from the quest for an ideal
28:07
filter to the quest for a non-ideal
28:09
filter okay and that gets us to a bloom
28:13
filter so what a bloom filter does is
28:17
okay I've placed a bloom filter here
28:20
where previously I had that ideal filter
28:23
the ideal filter taken a key if it gave
28:27
you the answer no that meant that there
28:30
was no record in the differential file
28:33
with that key if it said yes
28:36
that meant there is a record in the
28:38
differential file with that key okay a
28:41
bloom filter is unable to give you that
28:44
yes/no type of answer okay well the
28:47
bloom filter does is when it says no
28:50
it's accurate it means there is no
28:54
record in the differential file with
28:56
that key when it doesn't say no it gives
29:00
you the answer maybe okay so no really
29:04
means no and instead of a yes here which
29:08
the ideal filter has this fellow gives
29:10
you a maybe okay it says well maybe
29:13
there is a record here with that key
29:16
okay so whenever it says or maybe that's
29:19
when you have to kind of go and search
29:21
the differential index and at this time
29:23
the index is the real test
29:26
and if the index says yes you get it
29:28
from here if the index says no you have
29:30
to come back yeah so the only time you
29:35
have a performance hit is when something
29:40
called a filter error occurs which is
29:43
the filter says maybe and then the
29:46
differential index says no okay so if
29:50
the filter says maybe in the
29:52
differential indexes yes there's no
29:54
error okay it sent you on the right path
29:56
okay but if the photo says maybe and the
29:59
indexes know you were irori you were
30:02
erroneously sent on this path you could
30:03
have been sent directly here okay so we
30:07
call this a filter error when you're
30:10
sent on this thing here indicated by
30:12
these two black arrows if you end up
30:14
following that path you have a filter
30:16
error all right so this scheme works in
30:25
that it shows you that you will always
30:28
access the most current version of the
30:30
record the only time you have a
30:36
performance hit is when you have a
30:40
filter error if there are no errors
30:42
there's no performance hit OK again as
30:47
was the case with the ideal filter the
30:49
blue filters going to reside in memory
30:51
so there are no disk accesses associated
30:53
with searching the searching or querying
30:56
the filter all right so the objective
31:03
then becomes to design a bloom filter
31:05
which would minimize the probability of
31:08
a filter error
31:15
right your designer bloom filter what
31:19
we're going to do is we're going to grab
31:20
hold of as much internal memory as the
31:23
application can spare so let's call that
31:26
M bits so the filter is going to be M
31:29
bits long the more memory you can grab
31:36
for the filter the less the chances of a
31:39
filter error all of the bits of our
31:44
filter will be 0 when the differential
31:47
index is empty and as the index gets
31:51
populated some of these bits will get
31:53
changed to 1 we're going to employ a
31:58
collection of hash functions to
32:00
determine which bits to change to 1
32:02
whenever you enter a new key into the
32:05
differential index so we'll use up to H
32:11
functions so this will become an
32:14
optimization parameter in a moment to
32:19
try and decide how many we should use
32:21
okay so you could use 1 2 3 4 5
32:25
depending upon the situation right so
32:33
whenever we insert a key into the
32:35
differential index we will compute our
32:38
set of hash functions f1 through FHA and
32:41
set all of the corresponding bits in the
32:44
filter to 1 right so a bloom filter is
32:50
just a sequence of M bits coupled with H
32:54
hash functions the hash functions need
32:57
to be relatively independent so that if
32:59
you were given a key K f1 k you would
33:02
like that to be different from F to K
33:04
different from f3 different from a 4 5
33:06
up to FH so the way of operation is
33:13
every time you insert a new key into
33:15
your differential index we will go and
33:17
set some bits and the bloom filter to 1
33:20
and those bits are obtained by computing
33:22
the H hash functions that are in use
33:26
okay so given a key K will refer to the
33:30
values of those H hash functions as
33:32
defining the signature of the key also
33:40
get an example of how our filter works
33:42
alright so whenever you back up the
33:48
differential file in differential sorry
33:52
not when you back when you back it up
33:54
once you've done the integration okay so
33:58
say you've done an integration and the
34:01
differential file is empty okay
34:03
and so the differential index is empty
34:05
at that time you've bloom filter is
34:07
zeroed out well then you get requests to
34:11
perform updates and keys get entered
34:14
into the differential index and records
34:17
into the differential file and at that
34:20
time some of these zeros will change to
34:22
ones alright so in this case I've got
34:26
eleven bits again just so you don't
34:30
think this is the right representative
34:33
of the magnitude of M in typical
34:35
applications the number of bits you
34:37
might use would be in the order of
34:38
millions or tens of millions okay so
34:40
we're looking at a large number of bits
34:42
to make up one of these filters I'm
34:46
assuming for this example we've got two
34:48
hash functions again depending upon the
34:51
relationship between m and another
34:53
parameter that we'll see in a moment the
34:56
optimal choice for H could be one could
34:59
be two three four five all right the two
35:05
hash functions I'm going to use the
35:06
first one is just take the key and take
35:10
the remainder divided divided by M take
35:12
the remainder the other one is going to
35:14
take two times the key more then if you
35:19
get the key 15 so i compute 15 mod 11
35:23
that's a 4 so bit 4 will be changed to a
35:27
1 of compute 30 mod 11 that's an 8 so
35:33
bit 8 will get changed to a 1 okay
35:36
the signature for the key 15 is 1 is 4
35:39
and those two bits get changed to one if
35:44
you get the key 17 you compute 17 mod 11
35:47
which is a six so bit six will get
35:50
changed to a one and then you compute 34
35:53
mod 11 which is a one so but one gets
35:59
changed okay so 1/6 is the signature for
36:02
17 okay so whenever you get a new key
36:05
you computer signature and set the
36:07
appropriate bits to 1 all right now
36:16
suppose you get a request to search for
36:18
a record for a particular key K okay
36:22
well then you compute its signature if
36:25
any of the bits corresponding to its
36:28
signature are 0 you know with certainty
36:31
that key is not in the differential
36:33
index had it been that signature would
36:36
have all bits equal to 1 so if any bit
36:39
in the signature is 0 you know with
36:41
certainty it's not in the differential
36:42
index on the other hand if all bits are
36:46
1 you don't know with certainty that the
36:49
keys in the index it may be it may not
36:51
okay for example if K was 6 okay if you
36:59
compute the signature of this 6 mod 11
37:03
is 6 the bit is 1 12 mod 11 is 1 and
37:08
that's 1 okay so it's signature 1 6 has
37:13
all its bits equal to 1 but we know that
37:15
our differential index only has 15 and
37:18
17 okay so you get a filter error
37:24
all right so filter errors can occur
37:26
with the scheme okay and that's why you
37:30
have that maybe branch
37:35
okay now let's look at how you design
37:37
the filter okay now we need to know M
37:42
the number of bits how big is the filter
37:44
going to be this is really not an
37:48
optimization parameter we know that the
37:52
more the larger M is the better so you
37:55
look at the Machine you're going to work
37:56
on the application grab hold of as much
37:58
memory as you can okay so you want m to
38:02
be the largest that you can get we need
38:08
to know the number of hash functions
38:10
okay so that's something you can control
38:13
if you use a small number of functions
38:18
say H of 1 then the chances that two
38:21
keys have the same signature is higher
38:23
than if you use a larger number of hash
38:25
functions okay so the chances of getting
38:31
a filter error become more if the number
38:37
of hash functions is too small because
38:39
different keys will end up with the same
38:41
signature with a higher probability on
38:45
the other hand if your H is too large
38:48
well then too many of those bits will
38:50
get set to 1 very quickly okay so that's
38:56
not good either
38:57
ok so a small H is not good a large H is
39:01
not good something in the middle and we
39:02
need to decide well what is that
39:04
something between small and large that
39:06
we should be using we want to select the
39:12
function so that a relatively
39:13
independent which means that when you
39:16
compute F 1 F 2 F 3 you would like all
39:19
of these values to be different as much
39:22
of the time as possible ok all right so
39:28
the thing is now how do we choose the
39:30
number of hash functions right the
39:35
probability of filter error depends on
39:36
the filter size M oh that's going to be
39:42
as large as possible it depends on the
39:44
number of hash functions another thing
39:46
we haven't talked about is
39:47
it depends on how many updates you
39:49
tolerate before you integrate the whole
39:52
thing okay
39:55
so that controls how big does the
39:57
differential index get how big does the
39:59
differential file get and therefore it
40:03
also controls how many ones you get in
40:05
that bloom filter okay so so how many
40:10
update requests do you want to allow
40:14
before you zero out the differential
40:17
filter or you integrate everything so we
40:22
assume that that's provided to you so
40:26
maybe you decide that you're going to
40:28
integrate every day and then a day you
40:31
get ten thousand updates or you're going
40:34
to integrate every two days and you get
40:35
a million updates every two days okay
40:37
so we're going to assume that you is
40:39
known how many updates you can tolerate
40:42
before integration and we're going to
40:46
assume that all of these updates are
40:48
really two different records if some of
40:50
them to this are to the same well then
40:52
your performance is better than the
40:54
analysis would show all right so you is
40:59
a count on really how many inserts
41:01
deletes and changes before integration
41:07
so we're going to assume that M when you
41:09
are known we're going to assume that the
41:14
total number of records in your original
41:15
file is very large compared to the
41:17
number of updates you can tolerate
41:18
between integrations and what that
41:22
really means is that the total number of
41:24
records in the system pretty much stays
41:27
the same if you take n and you do you
41:30
inserts you get n plus you records but U
41:32
is very small compared to R and so you
41:34
still have about n records if you delete
41:37
end records from delete u records from n
41:40
well you still have about n records all
41:45
right so the probability of a filter
41:47
error
41:52
is highest just before integration
41:55
that's when you got maximum number of
41:58
ones in that filter so we look at the
42:03
highest probability for an error so
42:05
that's just after you've done that
42:06
update before integration it has two
42:10
components to it an A and a B okay in
42:13
order to have a filter error
42:14
you must it must be the case that you
42:18
have to be requesting a record that has
42:21
not been changed if you're requesting a
42:23
record that has been changed you go to
42:25
the differential index it's going to say
42:27
yes so you don't get a filter error okay
42:29
you have to follow that black path that
42:30
may be path okay so to follow that black
42:34
or maybe path there are two lengths
42:35
there and for one of those links you
42:40
have to be looking for an unmodified
42:42
record okay for the other link it must
42:47
be that or the signature is one okay so
42:50
the first link requires the signature to
42:52
be one and the second link requires that
42:55
it's an unmodified record okay so it's
42:58
the probability of both of these events
43:00
occurring for a given key
43:06
all right so you need to compute that
43:10
I'm going to skip the computation
43:12
because we are running out of time but
43:14
it's there on your things let's skip the
43:19
derivation okay so there's a slide that
43:23
derives I there's another one that
43:24
derives the B part here right so the
43:32
probability of an error is the a times B
43:35
and you take the stuff you derived for a
43:37
and B you end up with an equation that
43:38
looks like this then we make use of this
43:44
approximation which is valid for the
43:46
case where the number X here is large so
43:50
that means assuming that n is a large
43:51
number the total number of records the
43:54
original file is large in assuming here
43:57
that the number of bits in your filter
43:58
is watch okay so on under those
44:01
assumptions this approximation will be
44:02
valid and then we can replace this stuff
44:05
with an expression that looks like this
44:09
our objective is to minimize the
44:12
probability of a filter error so we're
44:14
going to differentiate this with respect
44:15
to H H is our optimization parameter
44:18
number of hash functions so you
44:22
differentiate with respect to H set it
44:24
to 0 you end up with something which
44:26
says about 0.7 m divided by u hash
44:30
functions again the point where the
44:36
derivative is 0 could either be a max or
44:38
a min okay well we know from our
44:40
previous discussion the max point is
44:44
either going to be when H is smaller
44:45
when H is large ok you'll get two maxes
44:49
and you'll have a min somewhere in
44:50
between so most likely it's a men but
44:53
you can verify that by drawing it okay
44:57
you can pluck the fellow out and it's
44:59
going to even have a kind of a
45:00
bell-shaped curve which will tell us
45:03
that what we computed there where the
45:05
derivative was 0 is in fact a minute
45:08
okay alternatively you can take the
45:11
second derivative of this function
45:15
alright so the analysis shows that under
45:20
suitable assumptions the optimal choice
45:21
of H is given by this expression and if
45:26
you plug in some possible values say you
45:29
got a filter with the million bits but
45:32
you need to tolerate half a million
45:35
updates before integration so if you
45:41
plug those numbers in over there you
45:45
need about 1.4 hash functions because
45:49
what that means is you either use one or
45:50
you use two if the bail is pretty flat
45:56
down here it would appear that since
45:58
this is closer to one you'd probably
45:59
want to use one the function value for P
46:02
is going to be smaller at one than at
46:04
one point then a two
46:13
on the other hand if you have two
46:15
million bits in your filter and you need
46:18
to tolerate half a million updates
46:21
between integrations you plug in you get
46:24
like two point seven seven and from the
46:28
nature of the function we know you would
46:31
either be using two or three you can
46:33
compute the function at H equals two at
46:35
H equals three see where it's smaller
46:36
but if this is a somewhat symmetric
46:39
thing then at three you would expect it
46:40
to be smaller than a two because this
46:42
number is closer to three all right so
46:47
you can go through the analysis making
46:49
suitable assumptions and then figure out
46:53
how many hash functions you should be
46:55
using to minimize the probability of
46:57
error all right any questions okay
47:24
okay before we start today you have any
47:27
questions you like to ask all right
47:33
today we're going to look at a new data
47:37
structure this one is called the segment
47:40
tree and this particular structure is
47:46
one of the most basic structures used in
47:48
the solution of computational geometry
47:50
problems the field of computational
47:55
geometry itself is a very interesting
47:57
field which deals with solving problems
48:02
defined on objects either in
48:04
one-dimensional two or three dimensional
48:07
or higher dimensional space so we're
48:10
talking about geometric objects so for
48:11
example you may be talking about points
48:14
points in one dimension two three or
48:17
more as an example of such a problem you
48:24
may be given a collection of points in
48:26
three dimensions which perhaps
48:27
represents the current location of
48:29
aircraft in space and ask to find the
48:35
pair of aircraft that are closest to
48:37
that particular time so you might use
48:39
that as a way to decide whether two
48:42
aircraft have become too close to each
48:44
other for safety so given a collection
48:47
of points in three dimensions find the
48:50
two that are closest and then if that
48:52
distance is below a threshold you might
48:54
signal some kind of an along or in two
48:57
dimensions you may have a have a design
49:03
say on a sheet piece of sheet metal
49:05
where the design indicates a bunch of
49:08
places where you're going to be drilling
49:09
holes so given the location of these
49:13
holes you may be interested in verifying
49:15
that no two holes are so close that when
49:18
you try to drill them you will have
49:19
metal fatigue or a tear in the metal
49:22
okay so again you're given a collection
49:25
of points in two dimensions find the
49:27
pair that is closest if the distance is
49:30
below a threshold you would probably
49:33
declare this design as an impractical
49:34
design
49:36
so people study closest pair of points
49:41
problems in many dimensions find the
49:46
best algorithms for detecting or for
49:49
determining the pair of points that are
49:51
closest it's a beard kind of a basic
49:55
problem studied in computational
49:57
geometry a related problem would be find
50:02
the nearest point to a given location
50:04
either in 1 2 3 or higher dimensional
50:08
space so for example you may be driving
50:13
around in a city and given your
50:16
coordinates as measured maybe by the
50:20
longitude and latitude of your position
50:22
you may be interested in finding the
50:25
nearest gas station okay so the gas
50:28
stations would be the known points and
50:30
your current point is something that
50:32
would vary it would be given to a
50:34
program as input your current location
50:37
and using a stored set of gas station
50:41
locations we'd need to find the nearest
50:43
point or nearest neighbor to your
50:44
particular current location or find the
50:49
nearest restaurant to where you are
50:50
equivalent type things okay all right so
50:53
people study in terms of point problems
50:56
they're commonly studied problems are
50:58
closest pair of points and find the
51:01
nearest neighbor to a given point in
51:07
terms of line problems in computational
51:10
geometry again people look at asking
51:13
questions with respect to collections of
51:15
lines either lines in one dimension two
51:17
three or higher
51:18
[Music]