Transcript


00:13
all right let's let's resume our
00:18
discussion of b-trees now in the last
00:23
lecture we looked at the definition of
00:25
b-trees we also saw how to insert an
00:28
item in a b-tree and today what we plan
00:32
to do is do an analysis of the number of
00:38
disk accesses made during an insert also
00:41
take a look at how you do a delete from
00:44
a b-tree and then take a look at a
00:48
suitable data structure to represent a
00:51
node in a V tree currently we are kind
00:55
of working with the implicit assumption
00:57
that B tree nodes are going to be
01:00
represented as arrays but that may not
01:03
be the best way to represent the node of
01:05
a B tree okay well you have any
01:08
questions on things we've done so far
01:10
before I do the balance of the B tree
01:14
stuff okay all right so things we're
01:23
going to do today worst an average case
01:27
accesses during an insert take a look at
01:30
the delete operation the analysis of the
01:32
delete and look at a structure for V
01:34
tree nodes okay all right so let's go
01:38
take a quick look at the insert
01:40
operation okay if you're trying to
01:46
insert say a key 14 okay
01:50
and if you assume that the whole be tree
01:52
is sitting on a disk then we need to do
01:57
one access to get the root okay one
01:59
access to get
02:03
this node and then an access to get that
02:07
node okay so three read accesses will
02:09
modify the node we'll need to write it
02:11
backs a total of four okay so three
02:15
reads at one right four accesses to
02:17
insert the 14 if you insert two it's
02:22
going to come in over here so again
02:24
three read accesses on the way down then
02:26
when you put it in here we're going to
02:28
split this node okay so we'll write out
02:31
the new version of this node will write
02:33
out the new split node and we have to
02:35
write out this one because that gets
02:37
changed so when a node gets split okay
02:41
you have to write out the old node and
02:45
the new version and then we're going to
02:47
kind of back up here and do something in
02:48
in this case we just have to write this
02:50
one out if you insert an 18 okay so
02:58
that's going to come into this node here
02:59
so three read accesses on the way down
03:02
this gets split so we'll write out two
03:04
nodes at this level then this one will
03:07
get split so we'll write out two nodes
03:09
at that level then this one will get
03:11
split so you write out two nodes at that
03:14
level and then you create a new route so
03:16
you write out one okay so whenever
03:19
there's a split at the level at which
03:21
there's a split you do two rights and
03:23
then after the last split you do a right
03:26
at its parent level alright so if you
03:32
assume that you have enough memory to
03:35
hold H nodes H is the height of the B
03:37
tree so on the way down you read in H
03:39
nodes okay and then on the way back up
03:42
when you do the splits you don't have to
03:44
read these nodes back in the sitting in
03:46
memory so H freed accesses on the way
03:50
down if the number of splits is s and
03:54
you can't have more than H splits os's
03:56
at most h2 writes at each level at which
04:00
there's a split there are s such levels
04:02
and then one right of the parent of the
04:05
last node that got split okay so
04:10
the total number of accesses 8 reads and
04:13
2 s plus 1 right accesses s cannot be
04:17
any more than H so it works out to 3 H
04:20
plus 1 now if you're dealing with very
04:30
large dictionaries and they're working
04:33
with an M of the order of say 200 then
04:36
your H might be as big as 4 5 or 6
04:40
so if looking at H of 5 this works out
04:42
to 16 disk accesses which still a large
04:46
number of disk accesses to be making now
04:52
the nice thing is that even though in
04:56
the worst case you might be making say
04:57
16 disk accesses when H is 5 on average
05:02
you don't make a large number of disk
05:04
accesses you do a write so typically if
05:08
if all you're doing is you start with a
05:10
dictionary and you keep adding items one
05:13
at a time say ok then the average number
05:15
of disk accesses tends to be rather
05:17
small so let's take a look at an
05:19
analysis under those conditions ok so
05:23
suppose you start with an empty B tree
05:25
and you just do a large number of
05:27
inserts ok so we insert n items and
05:33
let's say that when you're done you end
05:35
up with a B tree that has P nodes in it
05:39
ok so the total number of splits that
05:43
took place across these n inserts is at
05:46
most P minus 2 ok if the number of nodes
05:49
is more than 2 then we know that the
05:52
root has split at least once and when
05:54
the root splits you create 2 nodes when
05:58
every other node splits you create only
05:59
one new node ok so if you have P nodes
06:04
when you're done with the root is split
06:06
at least once and that created two new
06:08
ones each other split might have created
06:11
only one new node and so the total
06:13
number of splits is at most P minus 2 ok
06:18
so this is a total number of splits that
06:20
took place across all inserts
06:23
okay the number of pairs that you must
06:27
have inserted if you have P nodes okay
06:30
the roots got to have at least one item
06:33
in it the remaining nodes must have at
06:37
least M over two minus one items in it
06:39
okay and there are P minus one of those
06:41
nodes okay so you must have at least
06:43
this many items in your tree so you must
06:46
have done at least that many inserts so
06:50
the average number of splits okay is the
06:58
total number of splits you put an upper
07:00
bound on that P minus two divided by the
07:04
total number of inserts
07:05
which must be at least this many okay so
07:09
this is really an upper bound on the
07:11
number of splits per insert so if you
07:19
look at this the P minus two over P
07:22
minus one
07:22
well that's always going to be smaller
07:24
than one so I can throw that out and get
07:27
an even worse upper bound if you try
07:35
this out for the M value we've been
07:38
playing around with M of 200 the average
07:41
splits per insert works out to one over
07:44
99 at most one over 99 and what that
07:50
means is that the average number of disk
07:53
accesses per insert under this model of
07:55
just a long string of inserts works out
07:59
to be it was H plus
08:08
right so it works out to be let's see
08:11
what did we have I think we had h plus 2
08:14
s plus 1 yeah so it's H plus 2 over 99
08:17
plus 1 which is just slightly more than
08:20
H plus 1 h plus 1 is the bare minimum
08:25
you could expect because every insert
08:27
would require you to go all the way down
08:30
to the bottom of the B tree to make sure
08:31
you don't already have an item with that
08:33
key in it and then you would do one
08:36
right after you modify the leaf into
08:38
which you put it so you can't do an
08:41
insert with less than h plus 1 disk
08:43
accesses so h plus 1 is the minimum for
08:45
the average number of disk accesses for
08:47
an insert and the algorithm that we have
08:50
comes already close to that so at least
08:54
from the average standpoint you're doing
08:57
pretty good ok so some inserts may take
09:00
a long time they make many disk accesses
09:04
but an average is very small under the
09:08
model that you just do a long string of
09:10
inserts let's turn our attention to
09:16
delete and again yeah
09:29
what why do we need a lower bound on the
09:32
number of pairs ok well if I just take n
09:35
and stick this in over here P minus 2
09:39
over n it doesn't really tell me
09:41
anything so by putting in the lower
09:48
bound I was able to eliminate n is
09:50
really not known to us okay so I've been
09:54
able to pretty much eliminate I don't
09:56
know P I don't know an all it really
09:57
knows the order of my b-tree so I've
09:59
been able to get rid of all of the
10:00
unknowns and still come up with
10:02
something which said it's about H plus 1
10:05
disk accesses an average
10:10
all right so turning our attention to
10:13
the delete operation illustrate how
10:18
delete works by going through an example
10:19
using a b-tree of order 3 namely a 2-3
10:23
tree and then from that we can
10:24
generalize how it works on all me trees
10:30
all right so let's look at that example
10:32
2 3 3 suppose you want to delete the 8
10:40
now when you delete from a b-tree you
10:44
always transform deletion from an
10:45
interior node into deletion from a leaf
10:48
ok so we really are not going to try and
10:51
once you get the 8 out to try and
10:54
manipulate the tree to make it look like
10:56
a b-tree from that state instead what
11:00
I'll do is I'll find a suitable
11:02
replacement for the 8 from my current be
11:05
tree a suitable replacement for the 8 is
11:08
either the largest fellow in the sub
11:12
tree to its left ok so the 6 4 put the 6
11:16
there things were still satisfied the
11:18
search tree properties or the smallest
11:21
fellow in the sub tree to the right in
11:23
this case the mine I could move the 9
11:25
out there ok that's similar to how you
11:27
handle a delete from a binary search
11:29
tree when you're doing it from a degree
11:30
2 node ok so I like to move the 6 up
11:36
there I'll move the 9 up there so
11:39
suppose we decide to move the 6 up there
11:41
once you move the sex you're left with a
11:43
notice degree is to get the two external
11:46
nodes hanging off of it here ok you
11:47
don't have doing anything else alright
11:50
so first thing is transform it to a
11:53
delete from a leaf and then you only
11:55
need to consider deletion from a leaf
12:03
okay our strategy would be to replace
12:06
the item with the largest fellow in the
12:08
subtree to its left okay so for example
12:12
if you are deleting the four then we'll
12:15
replace the four with the three okay
12:18
some three to its left if you were
12:20
deleting the two we take the largest in
12:23
the subtree doors left a1 and then
12:26
you're left with the problem of removing
12:28
from a leaf because the largest in the
12:31
subtree doors left will always be in a
12:32
leaf node okay all right so we need to
12:39
really consider only deletion from a
12:40
leaf all right so I suppose you're
12:46
removing 16 so the simplest case is the
12:49
leaf currently has more items in it than
12:53
it needs and this node here
12:58
has one more item than it needs so
13:01
that's a removing either the 16 or 17
13:05
with the simplest case it would delete
13:07
so in this case all you do is you take
13:11
the item out of this node okay and
13:15
rewrite the new version of the node all
13:22
right so that's the simplest case the
13:25
leaf you're taking it out of has more
13:28
items than it needs
13:36
or next suppose you're removing from a
13:38
leaf that doesn't have more than it
13:41
means okay so for example here we're
13:43
removing it from this two node once you
13:46
take an item out of a node that doesn't
13:48
have more than it needs the node becomes
13:50
deficient it has one less than what it
13:53
needs so in the context of a two three
13:56
tree one less than what it needs be cut
13:58
means it becomes empty but in a larger
14:01
me tree be tree of higher order it may
14:04
not become empty it just has now M over
14:07
two minus two instead of M over two
14:09
minus one items all right so when we
14:16
take the 17 out of here okay the
14:19
strategy is going to be to try and
14:22
borrow an item from a nearest sibling
14:25
okay so this node has two nearest
14:28
siblings one on the right one on the
14:30
left
14:31
okay no node can have more than two
14:34
nearest siblings you can either have one
14:36
on the left or one on the right only one
14:37
on the left only one on the right okay
14:39
this node has only one near a sibling
14:42
it's on the right this one has only one
14:44
near a sibling it's on the left so we
14:48
try to borrow from a nearest sibling to
14:52
minimize the worst case disk accesses if
14:55
a node has two nearest siblings you only
14:57
look at one of the two to see if it has
15:00
more than what it needs okay so we have
15:02
a strategy which says if you have to
15:04
like this guy maybe you always look only
15:07
at the near assembling to the right okay
15:11
all right so this guy looks to the
15:12
nearest sibling on the right and
15:13
discovers that that fellow has more than
15:16
it needs if it has more than you can
15:19
borrow ok to borrow
15:21
you can't just take this item 30 and put
15:24
it here because then you upset the
15:25
search tree property borrowing is always
15:28
done through the parent so the 30 moves
15:32
up and the 20 moves down and that would
15:36
mean then you got three nodes that you
15:38
have to write back okay so you got to
15:39
read the sibling and then write back
15:41
this node this node in that node alright
15:45
so we do a borrow
15:47
okay because we're deleting from a node
15:49
that doesn't have that has the bare
15:53
minimum right now you take it out it
15:55
becomes deficient so we're going to fix
15:57
that deficiency by borrowing from an
15:59
error sibling and in this case we look
16:01
at their nearest right sibling alright
16:08
so when you do the borrowing you end up
16:11
with that configuration okay
16:13
so this node had become empty the 30
16:15
moved up and the in-between key 20 moves
16:18
down all right now suppose you're
16:30
deleting from a node that has the
16:32
minimum required once you do the delete
16:36
it becomes deficient you check its
16:38
nearest sibling okay whichever one
16:41
you're supposed to check in this case
16:43
the one on the right and that near a
16:45
sibling doesn't have anything extra it's
16:48
also got the minimum if you do a borrow
16:51
it doesn't solve your problem certainly
16:54
you can do the borrow the 40 moves up
16:55
the 30 moves down but now that fellow
16:57
becomes deficient okay so when the
17:00
nearest sibling that you check doesn't
17:02
have anything extra instead of borrowing
17:04
you do a combine okay so in a combine
17:09
the deficient node the in-between key
17:13
and this barely minimum occupied node
17:17
all combined into a single node
17:20
okay so that's the reverse of a split
17:23
okay so this guy that guy and that guy
17:28
will combine you'll get a node down here
17:29
that has a 30 and a 4 unit and when you
17:34
do that the degree of this node
17:37
decreases by 1 ok so from a 3 node it
17:40
becomes a 2 node ok alright so we're
17:44
deleting from a 2 node you check it's
17:47
near a sibling it's not a 3 node it
17:49
doesn't have anything extra and in that
17:53
case you do a combine so when you do a
17:56
combine it's a deficient node combining
17:59
with a No
18:00
odhh which has the absolute minimum
18:03
number of items together with the
18:05
in-between pair in the current node so
18:10
you end up with that situation all right
18:20
so when you do a combine which is done
18:23
at this level at that level only one
18:26
node had to be written out because the
18:27
other node disappeared okay so this node
18:30
gets written out the new version it went
18:32
from a 2 node to a 3 node you check the
18:37
parent the parents degree goes down by 1
18:40
so it could have become deficient in
18:43
this case it doesn't become deficiency
18:45
just write out the non deficient let's
18:56
try another example we're going to
18:57
delete the 30 it's in a node that has
19:03
extra items so whenever you're deleting
19:06
from a leaf that has extra items you
19:08
just take the item out and write that
19:10
note back so here that 3 node becomes a
19:14
2 node
19:18
things look like that
19:25
all right suppose you want to take out
19:27
the three so it's in a two node okay
19:31
bare minimum occupancy once you take out
19:34
the three it becomes deficient when a
19:37
node becomes deficient you check in near
19:39
a sibling to see if it has excess and
19:41
this guy's got excess you take out in
19:45
this case the smallest fellow move it up
19:48
to the parent and moved in between key
19:51
down so that's a bar okay so this will
19:53
end up four five six
19:56
all right so deleting from a two node in
19:59
this case check a sibling it's a three
20:02
node there's got more than it needs you
20:05
do a borrow with a parent and you end up
20:07
in that configuration
20:15
all right suppose you want to remove the
20:17
sex okay once you take it out the note
20:24
becomes deficient you want to fix the
20:26
deficiency you check a nearest sibling
20:30
it's only has a narrow sibling on the
20:32
left so you don't have a choice you go
20:33
to check this one okay and this guy
20:36
doesn't have anything to give you okay
20:40
it's at minimum occupancy so in this
20:43
case you do a combined okay so the
20:46
Nerissa bling which is at minimum
20:48
occupancy you who are deficient and the
20:50
in-between fellow all combined to form a
20:52
single node in this level so I got a
20:56
node which contains four and five and
20:58
the degree of that node will drop by one
21:15
all right now suppose you want to remove
21:17
the 40 okay
21:19
so again you come out here it's in a
21:22
node with the bare minimum once you take
21:24
out the 40 it becomes deficient you
21:27
check it's near a sibling and it's near
21:31
a sibling is the nine that's got nothing
21:35
extra so it's a combined okay so this
21:38
node the in-between key the deficient
21:40
node all combined to form a new node out
21:42
here the degree of the current decreases
21:46
by one
21:57
all right so when you do a combine the
22:00
degree of the parent decreases by one if
22:03
the parent becomes deficient as a result
22:06
you need to fix that deficiency and in
22:10
this case the parent has become
22:12
deficient to fix the deficiency I apply
22:16
the same process I applied at the lower
22:18
level check and near a sibling if the
22:22
nearest sibling has more than it needs
22:23
then you borrow so for example if this
22:27
had been a three node maybe in here
22:30
there is a seven if there's a seven
22:33
there's also a subtree hanging off of it
22:35
then I've moved the seven up here I'd
22:37
move the eight down here and this
22:39
subtree would be transferred to this
22:41
side okay so when you do a borrow it's
22:46
not just that the key from here moved up
22:48
but there's a subtree hanging off which
22:50
was null in the previous example that
22:52
subtree also moves to this side okay so
22:55
in that case you'll have an eight this
22:57
will be its right subtree in the subtree
22:58
that came from the other side would be
23:00
this left subtree okay so again you try
23:03
to borrow in this particular case you
23:06
can't borrow when you can't borrow you
23:08
do a combine so I need to do a combine
23:16
so we need to combine the belly the node
23:22
which is just minimally occupied the
23:25
in-between key and the deficient node
23:27
combined into a single node and that
23:29
decreases the degree of the parent okay
23:36
so the deficiency is propagated up
23:38
another level then you go up to that
23:42
level and try to do a borrow so you
23:46
check it's near a sibling in this case
23:48
there isn't one if you can do a borrow
23:51
you're fine a borrow will terminate the
23:54
backward process if you have to do a
23:56
combine then this thing propagates one
23:59
more level up in the worst case like
24:03
here this deficiency propagates all the
24:05
way up to the root and
24:10
okay so you want to check its notice
24:16
sibling to the bar oh it doesn't have a
24:17
sibling you know you had the root and at
24:20
this time you just take that root and
24:22
you throw it away because that root can
24:28
be deficient only if it has zero keys in
24:32
it
24:32
as Europe s the requirement for a root
24:35
is at least degree two okay so a
24:40
deficient root means it's empty and that
24:43
note can be thrown away irrespective of
24:46
the order of the be true you're dealing
24:48
with okay so you throw away the root and
24:54
you end up with a b-tree whose height is
24:57
one less than what you started with or
25:05
any questions about how you do a delete
25:13
right you have a borrow operation and
25:17
then you have the combined the combined
25:20
is really the inverse of the split that
25:22
was done during an insert
25:35
all right so in a general situation okay
25:39
so the steps are if you're deleting from
25:41
the interior you transform it to a
25:43
deletion from a leaf if the leaf becomes
25:48
deficient it triggers a bottom-up
25:51
borrowing and combining pass when you do
26:01
a combine you will be combining a node
26:08
that has minimal occupancy so that's M
26:12
over 2 minus 1 ok so this is the node
26:24
with minimal occupancy this the node
26:26
that's become deficient it has one less
26:28
than minimal occupancy and then you'll
26:31
have an in-between pair from the parent
26:37
okay so the new node you create you will
26:41
need to show that the new node isn't
26:44
over full so if you work out the sum of
26:48
these three things for the cases when M
26:51
is odd and when it's even you'll see
26:53
that in one case this works out exactly
26:55
a minus 1 in the other case it works out
26:57
to a minus 2 so the combining is
27:00
feasible
27:07
alright so the strategy that we went
27:10
through is guaranteed to work there are
27:18
any questions about that strategy for
27:22
general V trees so look at the number of
27:30
disk accesses that are needed well the
27:42
minimum number of disk access is needed
27:44
to make it delete okay you do in this
27:48
case and you have to go down level by
27:51
level to find the leaf from what you're
27:52
doing the delete so in this case three
27:56
read accesses and then if you're lucky
28:00
you end up at a node that will not
28:02
become deficient following the delete
28:04
you have to write this one out so in
28:07
this case four is the minimum with which
28:09
you can do the delete if you are
28:13
deleting from the interior then the
28:15
minimum would be five because you'd have
28:17
to write out this node as well as that
28:19
one but if you're deleting from a leaf
28:22
the minimum would be four so the minimum
28:27
is either h plus one or h plus two
28:29
depending upon whether you're deleting
28:31
from an interior from the leaf
28:35
now if the delete requires you to do a
28:37
borrow so for example you delete the
28:40
three okay well then in addition to the
28:45
three accesses you made on the way down
28:48
you would need to make one more access
28:50
to read the sibling then you'd have to
28:54
write this fellow out write that fellow
28:56
out and that fellow out okay so a borrow
29:03
requires you to make one read and three
29:07
write accesses
29:12
if you do a combined so maybe you take
29:18
the eight out then you still have to
29:22
read in the sibling so one read you're
29:24
going to write something out okay and
29:27
then you pump up to this level
29:32
so really the expensive one is the
29:34
borrow the borrow is one next to read
29:39
and three writes a combined if it were
29:41
determinate here but it wouldn't in this
29:43
case would be one read and two writes
29:48
the borrow is the expensive one but the
29:50
borrower can only be done once in a
29:52
delete the combine can be done many
29:55
times right so if we're looking for the
30:00
worst case then the worst case again we
30:04
make the same assumptions okay on the
30:07
way down you got to read one note at
30:09
each level then on the way back up for
30:12
the worst case we're going to assume
30:14
that you end up doing a combined at
30:17
every level except when you get to level
30:19
two you do a borrow because that borrow
30:23
will cost you three rights so on the way
30:29
up we assume it propagates all the way
30:32
up okay
30:33
so we'll be doing H minus one sibling
30:36
reads
30:43
then there are h- to rights of the
30:47
combined node because we doing combines
30:48
all the way up except at level two where
30:52
we're going to do a borrow so you got h
30:54
minus two of those and then at level two
30:56
will do the borrow and there will be
31:00
three rights down at that level so if
31:10
you add up everything is 3h so it's
31:15
about the same as when insert and insert
31:16
was 3 h plus 1 here is 3 h are any
31:23
questions about the analysis
31:36
let's do an average analysis again
31:39
making similar assumptions the last one
31:41
where we say you start with a b-tree
31:44
that has some number of items in it then
31:46
we do a long string of deletes all right
31:54
so it's got n items in it and there are
31:56
P nodes and we're going to delete the
31:58
items one by one till it becomes empty
32:04
our previous analysis said that the
32:07
number of items got to be at least this
32:08
money if you got P nodes and from this
32:15
one we can figure out that P is at most
32:18
that so we just rearrange the terms here
32:22
you get this alright you figure out how
32:31
many disk accesses you might have made
32:33
to start from these P nodes and end up
32:36
with 0 ok we will assume that every
32:43
delete that you do performs a borrow in
32:50
addition ok as far as borrowers go
32:55
borrows don't change the number of nodes
32:56
okay so borrowers don't change the
33:00
number of nodes but deletes decrease the
33:02
number of nodes and every time sir not
33:08
deletes but combines decrease the number
33:11
of nodes ok and every time we do a
33:14
combine you decrease the number of nodes
33:17
by at least one now when you do a this
33:21
combine that goes all the way up to the
33:23
top you actually decrease the number of
33:24
nodes by more than one because even the
33:26
root disappears the old root okay but
33:31
combines delete the number of nodes by
33:34
at least one and so across all of your
33:38
deletes you can't have more than P minus
33:41
one combines because in the end you have
33:44
to do a combine at the top level which
33:46
throws away the root
33:48
so the number of combines can be at most
33:51
P minus one across all n deletes but the
33:56
number of borrows could be as many as
33:57
the number of deletes alright so the
34:03
total number of disk accesses across all
34:12
of the deletes okay so the h plus four
34:17
here says there are h read accesses okay
34:22
for all all deletes on the way down okay
34:25
even if you're deleting from the
34:27
interior there are each going down okay
34:29
and then everybody's doing a borrow so
34:34
borrow means read a sibling and the
34:39
borrow means that you're going to do a
34:42
three writes okay so that takes care of
34:47
this okay the combines are being done on
34:53
top of the borrow that we're already
34:54
charging to this fellow okay so when
34:57
you're doing the combines okay we don't
35:01
have to count the read of the sibling
35:02
because we already charge that to the
35:05
borrow here so the combines are going to
35:09
be doing two writes each the total of P
35:14
minus one combined so you get two times
35:16
that now if you want to account for
35:26
deletes always taking place from the
35:28
interior then you probably want to add
35:30
one more for writing an interior node
35:33
back okay so maybe to discount here you
35:36
want to make that H plus five okay but
35:39
that's not going to change the average
35:41
we're going to come up with all right
35:45
so the total number of accesses
35:47
depending upon whether they're all from
35:49
leaves that or if you'll won them all
35:51
from interiors make it H plus 5 ok all
35:58
right so the average number of accesses
36:00
is that most the total number of
36:03
accesses that's the worst case you can
36:05
have divided by the number of deletes
36:11
that you're performing which is n ok and
36:14
if you divide these things out the
36:17
number of nodes if you assume M is of
36:19
the order of 200 number of nodes are
36:21
small compared to the number of items
36:23
okay so we forget about that term it's
36:25
just about H plus 4 or H plus 5 if you
36:30
account for deletes from interior all
36:37
right so it's it's not as bad as a 3 H
36:40
would suggest assuming you're doing a
36:43
long string of deletes the third delete
36:46
accesses is still pretty small the
36:53
minimum you could do it then would be H
36:56
plus 1 if you're deleting from a leaf
36:58
and H plus 2 if you're deleting from an
37:01
interior at least using our strategy to
37:03
delete from an interior
37:11
of course this is our analysis assumes
37:15
you get long strings of inserts and
37:16
deletes but that may not happen if you
37:20
get alternating inserts and deletes well
37:26
then it's possible that an insert comes
37:29
and it splits a whole bunch of nodes so
37:31
you make 3h plus 1 bisque accessors and
37:34
right after that you get a request to do
37:36
a delete which recombines all of these
37:39
nodes making 3h disk accesses and then
37:46
you get another insert that splits
37:47
everybody and then you get a delete that
37:48
combines everybody ok
37:52
so the worst that can happen to you of
37:54
course is you get this bad sequence of
37:57
alternating inserts and deletes that
37:59
make you do the maximum work possible ok
38:05
so it all depends on what kind of
38:07
sequences of inserts and deletes you're
38:09
working with or any questions about a
38:14
delete
38:29
right as I've mentioned earlier even
38:32
though much of the discussion we've had
38:35
has been with respect to disk accesses
38:37
and things b-trees can be used with
38:40
advantage for internal memory
38:42
applications over using something like a
38:45
red black tree because the height would
38:53
be smaller than that of a red black tree
38:55
so if you were using say a a B tree of
38:59
order a to row B to your hold of 16 its
39:02
worst case height would be less than
39:03
that of a red black tree and if a node
39:09
of order 8 or 16 would fit into one
39:13
cache line the number of cache misses
39:16
would be reduced then compared to
39:17
searching a red black tree so you could
39:23
use low degree B trees in internal
39:27
memory to improve dictionary performance
39:30
though most of the time when people talk
39:34
about B trees they are talking about
39:35
external memory applications let's turn
39:41
our attention to the structure of a node
39:47
yeah implicit in all of our discussion
39:49
has been that you're going to take the
39:53
up to M minus 1 pairs that are node
39:55
stores and store them sequentially
39:59
possibly using an array type structure
40:02
in ascending order of the keys of these
40:06
M minus 1 pairs but that may not be the
40:11
best way to go to figure out what
40:16
structure you should use you need to
40:19
[Music]
40:21
take a look at what are the operations
40:23
you want to perform on a node and once
40:26
you know the operations you can decide
40:27
the structure to use for a node alright
40:32
so this is the format of a node this
40:37
tells you how many pairs are being
40:39
stored in the node each P is a pair
40:41
and each a is a pointer to a subtree so
40:47
if the node is sitting on a disk then
40:49
this is really a disk address okay so
40:51
these aids are all disk addresses and
40:53
then the peas are the dictionary pairs
41:01
so when you do a search all you want to
41:05
do is you're given a key you want to
41:09
search these pairs to find out they're a
41:11
match or decide whether they should go
41:13
into the left subtree or the right
41:14
subtree of a pair okay so you want to
41:17
search for a given key and if there
41:22
isn't a match you want to find what you
41:24
might call it nearest match and then you
41:26
can decide whether you going to left the
41:27
right subtree of that nearest match
41:32
after doing an insert well then you need
41:38
to take a new pair and insert it into
41:42
the node okay so that everything is in
41:44
it say ascending order or searchable
41:46
order so you need to insert an item into
41:50
a sorted list or into your data
41:54
structure then you need to be able to
41:57
split okay so when you do a split okay
42:03
the split is done around the middle key
42:05
so you need you will define the middle
42:07
key and then everything that is smaller
42:13
than the middle key shows up here and
42:15
everything bigger than the middle key
42:17
shows up in the other node okay so it's
42:19
like a three-way split given the split
42:23
key which is the middle key okay
42:29
so you need to be able to find the
42:31
middle pair so given an index so if M is
42:35
100 you need to find the pair whose
42:38
index is 100 okay and once you found it
42:43
then you do a three-way split around
42:45
that pair
42:48
if you have a delete okay you need to be
42:54
able to do a borrow operation well it
42:56
was a simple case or in all of them from
43:00
a collection enable to delete an item
43:01
with a given key okay then you may need
43:06
to do a borrow to do a borrow you need
43:10
to take out either the least item from a
43:12
collection of the max item from a
43:13
collection depending upon which side you
43:15
are doing the borrowing then that item
43:17
goes into the parent node and replaces
43:19
something there so you need to do a
43:20
replace and then that item goes into the
43:24
deficient node and gets inserted there
43:26
as the largest or biggest item there
43:36
alright so we need to be able to do
43:38
delete we need to do a replace we need
43:40
to do an insert a combine is really a
43:48
three-way join you have a deficient node
43:51
you have the in-between pair you have a
43:54
barely a barely legal node it's just got
43:59
the bare minimum okay and it has that
44:01
relationship where one node has all
44:03
items smaller than the in between the
44:05
other node as all like no it's bigger
44:06
than the in between so we need to do a
44:08
three-way joint all right so let's see
44:14
what kind of note structure should we
44:15
use we could now remember these nodes
44:21
are actually going to be sitting out on
44:22
a disk so if you're thinking about using
44:25
some kind of a pointer structure you
44:27
need be able to write the pointers out
44:28
and read them back in so that they make
44:30
sense so what we'll do is we'll use an
44:35
array but instead of using language
44:37
provided pointers we will simulate
44:39
pointers with integers so pointers
44:43
really give an index into the array
44:44
rather than absolute memory address or a
44:47
relative memory address these are
44:49
indexes into an array okay so simulate
44:53
pointers using integers and I will have
44:56
an indexed red-black tree built into
44:59
this array
45:01
okay so each node will still have two
45:04
pointers but these will be numbers
45:05
instead of say c++ or pointers or java
45:09
references and the index part just says
45:14
each node keeps a left size value if you
45:17
recall many lectures ago we talked about
45:18
indexed search trees okay so the
45:21
indexing allows you to find the middle
45:23
pair easily the use of integers as
45:30
pointers allows us to write this whole
45:32
array out as one said this right and
45:36
then read it back in and still have our
45:38
structure correctly in memory all right
45:46
so we use simulated pointers so look at
45:48
the operations if you want to search a
45:52
node it's a red black tree okay you can
45:56
search it in logarithmic time you want
46:00
to find a middle pair because it's an
46:01
indexed red black tree you can find that
46:03
in logarithmic time you want to insert a
46:07
pair you can do that in logarithmic time
46:10
you want to delete a pair you can do
46:12
that in logarithmic time you want to
46:15
split a b-tree node into two nodes
46:18
well the split operation works in
46:20
logarithmic time and once you do the
46:25
split when you write the node out it's a
46:29
question of the first time you write it
46:30
out you want to be sure that the active
46:34
part of the node represents one of the
46:36
splits sub trees the second time you're
46:38
going to write the same or array out but
46:40
now the active part of the node is the
46:42
second sub tree okay though they were in
46:45
the same physical array you only join to
46:50
be trees even though you can join to red
46:52
black trees in logarithmic time in this
46:55
application your join is going to
46:56
actually take order of M time and the
47:01
reason it's going to take order of M
47:02
time is because these do to be tree
47:05
nodes and I'm sitting in two different
47:06
arrays in your memory and and you had to
47:09
physically copy the tree from one array
47:11
into the other array and there's no way
47:14
to do that
47:14
in less than linear time okay so the
47:20
joint in this application would run in
47:21
order M because of the copying alright
47:34
so if we make use of index red-black
47:36
trees we can get most of the things to
47:37
work very fast a lot faster than if you
47:40
used arrays if you use an array this
47:42
would work in logarithmic time this
47:45
would work in order one time but that
47:48
would take order of M that would take
47:50
order of M this would take order of m
47:53
and that would take order of them so you
47:58
can get things to work faster by using
48:00
index red-black trees for most of the
48:03
things this one runs slower but most of
48:06
the others run faster now whether you
48:11
really want to go to the trouble of
48:13
using red-black trees or not in a in
48:15
this application will depend upon how
48:19
much time you expect to save if it turns
48:23
out that almost all of your time is
48:25
going reading and writing and the
48:27
internal time is insignificant while
48:30
going from insignificant time to 20% of
48:33
insignificant time doesn't do you a lot
48:35
of good so even though you get
48:40
asymptotic improvement one would need to
48:41
assess whether the saving and the time
48:44
is worth the effort of programming
48:46
red-black trees in this application
48:55
alright any questions okay so we stop
49:03
here
49:04
you