Transcript


00:11
[Music]
00:15
okay all right today we're going to
00:20
start looking at a data structure that's
00:24
called a suffix tree I'd like to say
00:27
we're going to be looking at a new data
00:28
structure but actually as we'll see a
00:31
suffix tree is just an application of a
00:35
compressed try not necessarily a
00:37
compressed binary try a suffix trees
00:40
that compress try in which the keys are
00:45
constrained to be the suffixes of some
00:48
given strength okay so you start with a
00:51
given string you extract all of its
00:53
suffixes those become the keys you
00:56
create a compress try for those keys and
00:58
you get something that is known as a
01:00
suffix tree a suffix trees are useful in
01:04
solving a variety of problems that are
01:08
based on strings in an efficient manner
01:11
okay so to describe these problems let's
01:16
first define what we mean by a string
01:21
the substring and a subsequence and
01:24
actually it might help if I don't know
01:28
can you see that or yes not as bright as
01:31
getting brighter okay all right so by a
01:36
string we'll simply mean any sequence of
01:39
characters by a substring will mean some
01:45
part of that string that is obtained by
01:49
lopping off perhaps the left end of s
01:52
and the right end of s so you take some
01:54
contiguous set of characters inside the
01:57
string they'll define a substring okay
02:01
so for example if you have a string
02:04
cater then eight would be a substring
02:08
it's a contiguous set of characters
02:11
taken out of cater another way to
02:13
get it is you've lopped off the left end
02:15
of the string and you've lopped off some
02:18
right portion of the string what's left
02:20
behind is a substring okay the string
02:26
car is not a substring because it does
02:29
not represent any contiguous sequence of
02:32
characters within cater okay
02:34
alternatively you can't obtain it by
02:37
lopping off some left part in this case
02:39
you lop off in no left part and that
02:42
could get you the C and the a but
02:45
there's no right part that you can lop
02:46
off to get the R to come after the a the
02:53
empty string is a substring of s okay so
02:58
a substring is a contiguous set of
03:00
characters taken out of a given string a
03:06
subsequence on the other hand is it's
03:14
like a substring except that the
03:16
characters don't have to be contiguous
03:18
okay so you start with a string you
03:20
throw out any characters that you want
03:22
from there from the left from the right
03:24
from the middle from anywhere you like
03:25
and what you left behind would is a
03:27
substring a sorry a subsequence okay so
03:32
for example with the same starting with
03:35
the same string Kater okay eight is a
03:40
subsequence obtained by throwing out the
03:44
first character the fourth and the sorry
03:48
the first character and the fifth
03:49
character car is also a subsequence
03:54
obtained by throwing out the third and
03:57
fourth characters okay so car is a
04:01
subsequence but it's not a substring the
04:06
empty string is a subsequence so every
04:13
substring is a subsequence but some sub
04:15
sequences are not sub strings or any
04:19
questions about the distinction between
04:20
a substring and a subsequence
04:25
all right let's take a look at the
04:30
string pattern matching problem okay in
04:33
this problem you're given two entities
04:36
one is a string is called that s another
04:40
entity is a pattern let's call that P
04:43
sub I and the objective is to determine
04:49
whether or not P sub I is a substring of
04:52
s okay and in general applications you
05:00
not only want to determine whether or
05:02
not the substring but you'd also like to
05:03
know an occurrence of this substring so
05:07
maybe you want to find the first
05:08
occurrence of any occurrence of the
05:11
substring within the string yes you're
05:16
probably familiar with the
05:18
knuth-morris-pratt algorithm for string
05:22
matching if you're not it doesn't really
05:25
matter we're not going to look at the
05:27
details of it just the complexity right
05:33
this particular algorithm is able to
05:35
determine whether a pattern piece of I
05:37
is a substring of a string s and amount
05:40
of time this linear in the length of s
05:42
and the length of the pattern if you're
05:49
going to perform a large number of
05:51
queries of this type where the string
05:54
the string is constant okay and you're
05:58
just interrogating the string to find
06:00
out whether p1 is a substring P to the
06:04
substring p3 is a substring and so on so
06:08
for example you might do this in a in an
06:12
elementary version of a program designed
06:14
to check for plagiarism okay so your
06:18
string s may be somebody's book take the
06:22
whole thing represented as one long
06:24
string and now you have a paragraph that
06:26
somebody wrote you want to know was it
06:28
taken from this book so that paragraph
06:31
would be the pattern and you want to
06:34
know does it occur anywhere in there
06:35
well is it a substring of the book
06:38
in a more sophisticated application
06:42
you'd also want to look for where
06:45
there's some simple mutations of the
06:46
paragraph or a substring but in a
06:49
simplistic way you want to do an exact
06:51
match does this paragraph occur in this
06:54
book so the paragraph is the pattern and
06:57
the whole book is the string yes and you
07:01
may be doing this thing many times okay
07:04
so you get different patterns coming in
07:06
and you want to check each of these to
07:08
see if they have been taken from a
07:10
particular book so the string s namely
07:13
the book remains stationary it's that
07:16
it's a constant entity and the pattern
07:19
keeps changing in time also in this type
07:23
of application the string is very very
07:26
long compared to the pattern okay so the
07:29
pattern might be three sentences or four
07:31
sentences long the book might be 800
07:33
pages long so so one one place where you
07:39
may want to do the string pattern
07:44
matching problem repeatedly where the
07:47
string remains the same and the pattern
07:49
keeps changing is in this simplistic
07:50
plagiarism tester and so in that case if
07:56
you went the knuth-morris-pratt route
07:58
you would spend time that was n if you
08:01
were doing n searches so you've got n
08:03
different paragraphs you're checking for
08:05
occurrence you would do n times the
08:07
length of the book plus the length of
08:10
the end patterns and as you would expect
08:14
in this particular application this term
08:16
here would be the dominating term
08:18
because the string is much much larger
08:20
than the time okay so the complexity is
08:23
really order of n times the length of
08:26
the book
08:30
as we'll see if you go the suffix tree
08:33
route okay
08:36
the complexity will look like this okay
08:40
so it is the length of the book plus the
08:46
sum of the lengths of the patterns okay
08:48
that this sum here could be small
08:52
relative to this really you're seeing
08:55
maybe a factor of n improvement and
08:58
being the number of patterns even if n
09:05
is very large so that this thing here
09:10
becomes much bigger than that this term
09:12
here still represents a significant
09:13
improvement in runtime over that okay so
09:19
the suffix tree route would give us a
09:20
much faster solution and the difference
09:27
comes about because the
09:28
knuth-morris-pratt algorithm in order to
09:31
perform its search starts by
09:35
pre-processing the query okay so it
09:38
spends an amount of time that's linear
09:40
in the length of the query pattern P sub
09:44
I to pre-process it essentially sets up
09:47
a finite automata for P sub I and then
09:50
uses that finite automata against the
09:53
string s to find out whether or not that
09:55
pattern occurs with an S and that search
10:00
then takes in the amount of time equal
10:02
to the length of s the suffix tree
10:12
solution that we're going to see instead
10:14
of processing the pattern each time it
10:16
is going to process the string once and
10:19
set up a try for that string a suffix
10:23
tree all of the suffixes of that string
10:26
would be taken and a try would be
10:28
created and we'll be able to construct
10:31
that suffix tree in time linear in the
10:33
length of s and then each time you get a
10:37
pattern there's no pre-processing done
10:39
on the pattern you do a raw search in
10:41
the try for that pattern taking an
10:42
amount of time linear in the length of
10:44
the
10:44
and that's how you get the length of s
10:48
plus some of the lengths of P sub I time
10:50
again they'll become clearer when you
10:52
look at the data structure or another
10:56
application of string matching beyond
10:58
the simplistic pleasure and plagiarism
11:01
tester would be if you look at what goes
11:06
on in gene sequence matching okay so you
11:11
have a databank or a database of strings
11:13
these are really gene sequences that
11:15
people know about and you may have
11:19
hundreds of thousands of these varying
11:22
in length from maybe small ones about 20
11:26
going up to large ones which may be
11:27
10,000 characters or more in these
11:32
strings or gene sequences are strings of
11:38
characters limited to a set of four
11:41
characters ATG and F and then basically
11:45
what happens is other researchers in the
11:48
field perhaps develop a gene sequence
11:54
and they'd like to know whether this
11:56
they discovered something new or
11:57
something that exists so you submit the
12:00
sequence you have to a databank of known
12:02
sequences to find out whether or not the
12:06
sequence exists and really what you're
12:08
asking for is not whether this sequence
12:10
exists as a single separate whole entity
12:12
but does this new sequence of what you
12:15
considers and you think maybe a new
12:17
sequence does it occur as a substring of
12:20
a known gene sequence okay so that gives
12:24
you again the substring matching problem
12:26
except that now you're looking for
12:29
whether your given pattern piece of why
12:32
that's the thing you think you've
12:33
discovered as a new gene sequence does
12:36
it occur the substring in any one of a
12:38
hundred thousand known sequences okay so
12:41
it's kind of a bigger problem again
12:44
there that's kind of a simplistic
12:45
version of the problem you actually are
12:49
looking even for close matches but in a
12:54
simplistic version you're looking for an
12:55
exact match
12:57
and in fact there are services available
13:00
where researchers simply every day just
13:04
submit their sequence and maybe half an
13:07
hour later they get back a response
13:08
which says here are ten that are close
13:10
to what you have okay alright so those
13:18
are just two possible applications for
13:22
string pattern matching alright so with
13:27
that as motivation let's take a look at
13:29
the suffix tree data structure as I said
13:34
it's basically a compressed tri edge
13:37
information means that from any node you
13:39
can figure out what characters were
13:40
skipped as you go by edges when you're
13:45
going down the tree the keys in the
13:50
compressed tri are the suffixes of a
13:54
given string s so if you're thinking
13:58
about a book then you would extract all
14:00
of the suffixes of the book and
14:01
certainly in the case of a book you get
14:03
a very large number of suffixes which
14:04
means you're going to be building a huge
14:07
data structure
14:12
all right for our example we're going to
14:15
look at something much smaller just look
14:19
at this 7 character string yes so if I
14:24
had to build a suffix tree for this 7
14:26
character string s the keys that this
14:29
suffix tree would contain would be the 7
14:32
non empty suffixes of sleeper the
14:37
longest one would be the entire thing
14:38
obtained by lopping off nothing from the
14:41
left or the right syrup for the suffixes
14:45
you only lop off from the left hand ok
14:47
so the longest suffix will be sleeper
14:50
the next one would be leaper and then
14:53
you'd get eeper and then a / and / an or
14:59
and all right so we got seven suffixes
15:03
and our compressed try is going to
15:05
contain these seven keys and before
15:11
looking at the compressed right let's
15:14
see why are we interested in building a
15:16
try of suffixes how does it relate to
15:22
pattern matching so the claim is that a
15:26
pattern P sub I is a substring of a
15:29
given string s if and only if that
15:32
pattern is the front part of some suffix
15:36
okay so if we look back here at our
15:41
seven not suffixes okay and I want to
15:47
ask the question is leap a substring of
15:51
sleeper okay so here I can visually see
15:53
that it is at the same time if I look at
15:57
any one occurrence of leap here I know
16:01
that leap is the front part of a suffix
16:03
that begins at this occurrence and goes
16:06
all the way to the end
16:07
okay so leap is a front part of the
16:10
suffix leaper
16:11
similarly eep-eep is a front part of the
16:16
suffix e / p is a front part of the
16:22
suffix /
16:23
leap if it had occurred in sleeper it
16:29
would be the front part of a suffix
16:32
corresponding to the occurrence of leap
16:34
okay
16:35
the fact that leap is not the front part
16:37
of any of the suffixes tells us that
16:40
leap doesn't occur in sleeper same thing
16:45
for peel if peel had been a suffix it
16:48
would be the front part of one of the
16:50
listed sorry if if peel had been a
16:52
substring it would be the front part of
16:55
one of the listed suffixes since it's
16:57
not the front part of one of the listed
16:59
suffixes peel does not occur in sleeper
17:03
ok alright so the the reason we want to
17:06
have a try of suffixes is this
17:08
particular property here which says P
17:11
sub I is a substring of s if and only if
17:14
it is the front part or prefix of some
17:18
suffix of s does everybody agree with
17:22
that statement alright so if once we
17:30
make this observation we tie this with a
17:33
statement I made in an earlier lecture
17:36
which says one application of tries is
17:38
to keyword completion so if you're
17:43
sitting in UNIX and you want to type a
17:45
command you type the first three letters
17:46
and then you can hit a tab button and
17:49
it'll complete it if it's unique that's
17:51
because you can do prefix completion
17:53
using a try okay so I I can do prefix
17:59
searches in a try very quickly is a
18:03
given key the front part of one of the
18:06
key stored in the try that's a very
18:07
quick search to perform and so if I have
18:10
a try of suffixes then I can determine
18:13
if the pattern occurs by simply asking
18:16
is this pattern the front part of one of
18:18
the keys in the try now before looking
18:23
at the try before looking at a sample
18:28
suffix tree let's take a look at what we
18:31
have to do when the last character of s
18:34
the string you want to be searching and
18:37
repeats now if the last character and s
18:41
repeats then we can conclude that when
18:46
you make your list of keys the suffixes
18:48
what one of at least one of these is
18:51
going to be a proper prefix of another
18:55
okay so for example here if you look at
18:58
creeper the last character r appears in
19:01
creeper twice once at the end and once
19:03
somewhere else okay if I look at my list
19:08
of suffixes for creeper then R is a
19:12
proper prefix of Reaper okay and you
19:17
guarantee that will happen this last
19:20
suffix is going to be a proper prefix of
19:23
all of the suffixes where our occurs in
19:28
the string okay so if our red occurred
19:32
at two other places or would be a proper
19:36
prefix of the one beginning here plus
19:38
the one beginning at this other
19:39
occurrence okay so whenever the last
19:44
character of s occurs more than once we
19:48
know that the key is going into our try
19:51
violate a property we stated last time
19:55
which is that no key should be a proper
19:57
prefix of another okay and you can
20:02
convince yourself that that's the only
20:04
time that property is violated when the
20:06
last character occurs more than once
20:15
so to overcome this violation of the
20:18
requirement on tries that no key be a
20:21
proper prefix of another we will add a
20:25
special character to the end of s call
20:28
that the pound sign or amico and since
20:31
the Mokka occurs only once at the right
20:33
end we no longer have this problem here
20:40
okay
20:41
this marker occurs exactly once and so
20:43
you we will not have a violation of the
20:46
try property requirement so really I
20:52
will change creeper to creeper pound
20:54
sign and then build the suffix tree okay
21:04
alright so again you can visually see
21:06
that if you make a creeper pound sign
21:09
these would be your suffixes and now no
21:12
suffix or none of these keys is a proper
21:15
prefix of another key alright so for our
21:23
example I'll take a look at an abstract
21:25
string made from two different
21:29
characters a and B and then I've
21:31
appended the pound sign because B occurs
21:34
more than once in what I started off
21:36
with so the size of the alphabet has
21:41
gone from two characters a and B to
21:42
three a B and pound the try I'm going to
21:47
build is going to sample the key
21:50
character by characters so the radix in
21:53
this case would be a three radix so can
21:58
extract an A and a B and a pound right
22:05
to extract the suffixes from there build
22:08
the compressed tri nodes will have
22:12
three-way branches depending upon
22:14
whether you've extracted an A or B or
22:16
account this is what you end up with the
22:21
numbers outside give you the character
22:23
on which your branching
22:27
so here we're going to branch and
22:29
character one if the first character in
22:31
the keys and a will end up here if it's
22:35
a B you'll end up here and if it's a
22:38
pound sign you end up over there
22:41
here we are branching on character five
22:43
if character five is an a will be here
22:46
if it's a B will be there if it's a
22:48
pound well then I've got two no I don't
22:50
have that subtree here okay here I'm
22:54
branching on character two of your key
22:56
okay if it's an A you're down here if
22:59
it's a B you're here if it's a pound you
23:01
know with that right the edges here have
23:05
been labeled with the first one gives
23:09
you the character that was used to do
23:10
the branching okay the balance of it
23:13
gives you the characters that were
23:14
skipped over okay so this this foot over
23:23
here you're branching on character one
23:24
of the or key so this is character one
23:26
then this is two three and four which
23:28
were skipped and then character five is
23:31
used here for branching okay so this is
23:34
character five of the key than six seven
23:36
eight nine and ten
23:39
this is character five of the key and
23:41
that's character six so when you get
23:44
down here this node here okay
23:47
this element node represents a BBB a
23:51
bbbb town so it represents this whole
23:54
string yes okay this fellow here represe
24:00
by just concatenate concatenated labels
24:03
on the way down okay so this one here
24:05
represents a b b b b pound
24:09
okay so that'll be the fellow that
24:11
begins at this a and goes to the end
24:14
right this string here has as many
24:19
suffixes as i've got yellow or element
24:21
nodes here this guy here represents the
24:25
suffix pound which is just the rightmost
24:27
fellow this one here represents be pound
24:36
all right any questions about this try
24:45
alright if you look at this try if I was
24:48
to actually populate these nodes with
24:51
the keys that they represent I would not
24:54
be able to build the structure in linear
24:56
time
24:56
because the time required would at least
25:01
be the sum of the lengths of the keys
25:02
because I've got to store the keys in
25:04
these nodes okay and if you look at
25:06
suffixes you got one suffix of length
25:08
one one of length two one over length
25:09
three one over length four one over
25:11
length five six seven and so on and this
25:13
was of length n the sum of the lengths
25:16
of the keys being stored is n square
25:18
order of n square and so you'd have to
25:20
spend at least order of n square time
25:21
storing the keys in these nodes
25:28
similarly if you actually stored edge
25:32
labels on these edges you would have to
25:35
spend order of n square time okay so if
25:37
you add up the lengths of all of these
25:39
things it should work out to be about n
25:41
square yet we can construct this whole
25:46
thing in order n because we don't store
25:48
the keys explicitly in the yellow nodes
25:51
nor do we store the skipped over
25:54
characters and the branch character on
25:56
the edges right instead in these yellow
26:03
nodes okay what we're going to do is
26:06
we're going to store an index into the
26:08
string which says the suffix begins at
26:11
position one okay so this one here says
26:13
the key stored here is a suffix that
26:17
begins at position one of the strength
26:19
and since it's a suffix if you know
26:20
where it begins the ending is always
26:22
known it's the rightmost position okay
26:25
this thing here says the key stored here
26:28
is the suffix that begins at position 4
26:30
1 2 3 so it begins here at this B and
26:33
goes all the way this 9 here says that
26:37
the key stored here is the suffix that
26:39
begins at position 9 namely this B and
26:42
then goes all the way okay so we have an
26:45
indexing scheme the element nodes
26:48
actually
26:48
store numbers or indexes into the
26:50
strengths okay so the string is stored
26:52
once using order of n space and then in
26:57
each of these places
26:58
you keep an index into the string and
27:01
from that index you can reconstruct the
27:02
suffix or the key stored in that note
27:11
all right similarly for the edge
27:13
information in each branch node okay
27:19
you keep one of the index is stored in
27:26
that subtree okay basically this is an
27:30
idea we used in an earlier lecture on
27:33
tries where in each branch node you kept
27:36
a pointer to one of the element nodes so
27:38
that you can extract a key and then from
27:41
that key you can figure out which
27:42
characters were skipped over okay so the
27:45
idea is the same here but instead of
27:48
keeping a pointer to one of these nodes
27:50
and then extracting that keep that index
27:53
and then from the index getting the key
27:55
I just take that index and store it
27:57
directly here okay so in inside the
28:01
overall tree one of the index is 1 so
28:04
I've stored that number here logically
28:07
it's equivalent to storing a pointer to
28:08
this node and then from this node being
28:11
able to figure out what the key is okay
28:14
similarly in this node I've stored an
28:17
index corresponding to one of the
28:20
indexes stored in the yellow nodes in
28:22
that tree okay this fellow keeps one of
28:25
the indexes in this tree for its root
28:29
this one keeps one of the indexes two
28:32
for one of the other nodes in its
28:34
subtree so using this index you can get
28:40
to a suffix which is one of the keys
28:43
here and from that suffix you can figure
28:45
out which keys were skipped over or
28:47
which characters okay so for example if
28:51
you're doing a if you're following a
28:54
path so if you come from here to here
28:55
you know that here you're going to use
28:58
character 1 and it has to be an a get
29:01
that from here
29:02
and then you're going to come down here
29:04
where you're going to use character five
29:06
but before using character five you'd
29:08
like to check the characters two three
29:10
and four agree with your search key to
29:14
figure that out I'll use the numbers
29:16
stored out here the node that I reach
29:19
okay in this case it says one okay and
29:22
from here
29:23
okay it doesn't matter whether you get
29:25
to the one or the five because all of
29:27
these have the same characters two three
29:30
and four that's why we skipped over them
29:33
they also have the same character one
29:35
that's why they're in that subtree okay
29:37
so it doesn't matter which of these
29:39
fellows you have information about when
29:41
you get here because we're going to use
29:44
that key to figure out which characters
29:48
you skipped over and all of them have
29:50
the same values for the skipped over
29:52
characters okay so any one of those
29:54
would do so from here I use this number
29:57
one I get here
29:58
this fellow is the one that you branched
30:00
on then the next one gives you the first
30:03
character skipped over this gives you
30:05
the next one skipped over and that gives
30:06
you the third one skipped over okay so
30:08
from here I can extract the information
30:10
that was stored on this edge the first
30:12
one is the character you branched on and
30:14
the next three tell you what you skipped
30:16
over okay you know that you skipped over
30:19
three because you look at the character
30:21
number here is one the character number
30:23
there is five you subtract the two and
30:24
subtract another one and that tells you
30:27
how many you skipped over or if you just
30:29
subtract the two that tells you how many
30:30
characters are on this label okay so
30:36
storing the edge labels explicitly would
30:40
take order n square time and space
30:42
storing the keys of the suffixes
30:44
explicitly would take N squared time and
30:47
space we don't do that we want to have a
30:49
linear time and space construction for
30:53
the structure okay and so in the yellow
30:57
nodes we store indexes into the string
30:59
and in the branch nodes we also store
31:02
indexes into the string from these
31:05
indexes we can reconstruct the keys that
31:07
would otherwise have been stored in the
31:08
yellow nodes we can reconstruct the
31:10
labels that otherwise would have been
31:12
stored on the edges
31:17
okay so we're not going to look at the
31:20
construction algorithm in class if
31:22
you're interested you can check the
31:24
readings on the web
31:26
the readings describe an algorithm that
31:29
runs in order of n times our amount of
31:33
time and being the length of the string
31:35
are being the alphabet size if you're
31:38
looking at gene sequences you got only
31:41
four characters are as four if you're
31:43
going to put on the pound sign then R
31:45
becomes five okay so if R is considered
31:53
to be a constant then this becomes order
31:55
n the same algorithm described on the
32:02
web reading runs an order of an expected
32:04
time if you use a hash table instead
32:06
instead of array nodes there's a
32:12
reference there for to another round
32:14
with them which runs in order n time and
32:17
this is order n even under the
32:20
assumption that R is not a constant all
32:28
right what we want to do is we want to
32:30
look at how we're going to use a
32:32
substring how we're going to use a
32:34
suffix tree to solve string type
32:37
problems okay all right so first let's
32:42
take a look at the problem we started
32:43
with which is the substring matching
32:45
problem okay so the string is known to
32:50
me ahead of time I run my suffix tree
32:54
construction algorithm and I build this
32:55
and although we haven't looked at the
32:58
algorithm we know that it's going to run
33:01
in order of n time okay and it's also
33:04
going to take order of n space and being
33:06
the length of the string now having
33:11
built this in order of n time then I get
33:13
a request to determine whether or not be
33:16
ABB is a substring and if it is tell me
33:19
where it occurs well I start at the top
33:22
and this says branch and character one I
33:24
look at character one which is a B so I
33:26
end up here
33:27
okay then I this fellow says branch in
33:30
character - and character - is an a so
33:33
I'm going to follow this path okay now
33:35
when I'll follow a path instead of just
33:38
falling all the way down to the bottom
33:40
and then determining that you don't have
33:42
a match I'll do partial matches along
33:45
the way okay so here since you didn't
33:48
really skip over any character this
33:49
label is of length 1 and because you
33:53
followed this branch you have to have a
33:54
match okay here this label here is of
33:58
length 1 2 3 4 5 6 so there were 5
34:00
characters that was skipped the a
34:03
corresponds to the branch so you have to
34:05
match with the a but the other 5 you may
34:08
not match with when you follow this so
34:10
as you follow these edges once you get
34:12
here you use the pointer or index stored
34:15
here into this thing to figure out what
34:17
you skipped over and you match the rest
34:20
okay so the B and the a have already
34:22
been matched the B and the a were
34:24
matched because of ranching and then you
34:26
take the rest of it BB and it matches
34:29
this part here okay
34:31
so you know there's a match you finished
34:34
with this string so you're done with
34:36
checking okay so at this point you know
34:38
there's a match and if you want to know
34:41
where it occurs you take the index that
34:43
is stored out here okay because you
34:46
finished in the middle of an edge you
34:47
just take the index that's stored here
34:49
it'll tell you one place where it'll get
34:53
you to one of the yellow nodes in the
34:54
subtree in this case there's only one
34:56
yellow node which is itself but in a
34:59
more general situation you will arrive
35:00
at a branch node and from there you take
35:03
the index stored in there get to one of
35:05
these fellows and read off the
35:07
occurrence of that substring
35:11
let's try another one ABB B a so you
35:15
start out here branch and character 1
35:17
it's an a we're going to be skipping
35:20
over BBB so I do a match find that
35:24
they're equal then I do a branch and 5
35:27
which is an A ok and you end up here
35:31
when you follow an edge you match the
35:34
skipped over characters here there are
35:36
no skipped over characters to do the
35:39
match so again you kind of terminate in
35:40
the middle
35:41
of an edge and any yellow node down
35:44
there is a matching substring if you do
35:53
a BA BA okay you start out here you
35:57
branch and character one you end up here
35:59
you branch on character two you're going
36:01
to follow this link when you follow a
36:04
link you match on the skipped over
36:05
characters so the B match there the a
36:09
match there the next B matches here but
36:11
the a fails to match that B that tells
36:13
you that there is no occurrence of ba PA
36:22
all right the time required to do the
36:25
match is linear in the length of the
36:26
pattern okay so as you go down you look
36:30
at each character in the pattern once
36:31
and based on that you either find a
36:33
match or you do a branch or if it's a
36:35
non match you stop so you can do
36:41
substring matching in time linear in the
36:45
pattern there is an overhead price that
36:50
you pay in the beginning to set up your
36:52
suffix tree but that's linear in the
36:53
length of the string all right any
37:01
questions about how to use a suffix tree
37:03
to do substring matching
37:15
but suppose you wanted to find all
37:17
occurrences of a pattern not just one so
37:26
we'll follow the search process we just
37:28
had for the pattern and let's suppose
37:34
that the search is successful okay if
37:38
the search terminates at the yellow node
37:42
then that's the only occurrence of that
37:44
pattern on the other hand if the search
37:48
terminates at a branch node all yellow
37:50
nodes in that subtree are occurrences
37:53
and they are the only occurrences so for
37:58
example here if you're searching for a
38:01
for B's and a pound sign so we're going
38:07
to start out here branch and character 1
38:10
compare the rest of this or ABV beefs
38:13
all of match you come here and we look
38:15
at character 5 which is a B you're going
38:17
to follow this you have a match with
38:18
being a pound sign you end up here so
38:23
this tells you that the pattern is a
38:25
substring one occurrence of it is at
38:29
position 5 further there are no other
38:32
occurrences ok so you can get all of the
38:38
occurrences there's only one of them and
38:39
it begins at position 5 look at an
38:45
example
38:46
oh I don't think I have a picture for
38:48
this ok so what we I do ok all right so
38:54
let's suppose ok instead you're
38:57
searching for a B okay so you start out
39:00
here you branch in character 1 you match
39:04
over the skipped characters so the
39:06
search finishes really in the middle on
39:09
this edge but you end up eventually at
39:11
this node ok this is a branch node a
39:14
branch node is guaranteed to have more
39:17
than one yellow node in its subtree
39:19
because this is a compressed try so that
39:21
tells us that there's more than one
39:24
occurrence and in fact the number of
39:27
occurrences is
39:28
exactly equal to the number of yellow
39:29
nodes in the subtree or subtree that's
39:32
routed here okay I can get all of the
39:37
occurrences if I augment the structure
39:40
and the time of construction okay so
39:44
what I do is I'm going to augment it by
39:49
linking together all of the element
39:55
nodes or yellow nodes in in order okay
39:59
and in each branch node I'm going to
40:01
keep a pointer to the first yellow node
40:04
in that subtree and to the last yellow
40:06
node in the subtree okay so if I look at
40:11
my so here first thing I'm going to do
40:13
is I want to link all of these yellow
40:15
nodes together okay and they're going to
40:19
be linked together to form a chain link
40:21
together in in order so create this
40:27
chain of yellow nodes in each branch
40:30
node I'm going to keep a pointer to the
40:32
first and last node in this chain for
40:35
that corresponding subtree so if I look
40:41
at this subtree here the first yellow
40:46
node is number 1 and the second yellow
40:49
node is number 5 if you look at the
40:54
overall tree the first yellow node is
40:56
number 1 and the last yellow node is
40:58
number 10 if you look at this fellow
41:01
here first one is number 4 last one is
41:05
number 9 the first one is number 3
41:08
last one is 8 first one is number 2 last
41:13
one is 7 so each branch node keeps track
41:17
of the first and last nodes on the chain
41:19
for its subtree so then when you reach a
41:23
particular branch node suppose you're
41:25
doing a search and the search terminates
41:27
here and you want to report all
41:29
occurrences of B well then you'd use
41:33
this yellow pointer to get to the first
41:35
occurrence okay which is this fellow and
41:37
then you just follow down the chain
41:38
until you reach the last fellow
41:42
you don't have to do any compares if you
41:45
had to do compares your you'd run in
41:47
quadratic time okay
41:50
so because of these pointers no compares
41:54
aren't needed you know that there's a
41:56
match everywhere you just follow down
41:58
the chain in one unit for occurrence you
42:01
can report the occurrence so you get a
42:07
fast scheme to report all occurrences of
42:09
a pattern in a given string the chaining
42:16
and setting up of these pointers is done
42:18
only once okay you want to find the
42:26
longest repeating substring given a
42:31
string and somebody says well find the
42:33
longest string that occurs at least six
42:35
times now find the longest string that
42:37
occurs at least ten times or two times
42:39
or one so whatever okay sounds like a
42:43
formidable problem but turns out exactly
42:46
very simple when you construct the try
42:50
you leap you label each branch node with
42:53
the number of yellow nodes in it and
43:01
then your task becomes to find a branch
43:08
node with a large enough label okay so
43:12
long as you're looking for an M greater
43:13
than one you have to be at a branch node
43:16
because only the branch nodes represent
43:19
occurrences of things that occur more
43:20
than once so you're looking for a label
43:23
for a branch node with a large enough
43:24
label and the charge number on the
43:31
branch node gives you the length of the
43:36
string that is occurring oh that is
43:39
recurring okay take a look at an example
43:42
here okay see the chart number here is
43:46
five okay and what the five tells you is
43:49
that these yellow nodes here represent
43:52
occurrences of a string of length four
43:54
ABB
43:57
the chart number here is 2 and what that
43:59
tells you is the yellow nodes down here
44:01
represent occurrences of a string of
44:03
length 1 namely B the chart number here
44:08
I think was a 3 yeah so that tells you
44:13
that the yellow nodes here represent
44:15
occurrences of a string of length to be
44:17
B okay the count tells you how many
44:21
times it occurs okay so the chart number
44:25
and that the chart number tells you the
44:27
length of the repeating substring and
44:31
the count tells you how many times it
44:33
occurs okay so if you're looking for C I
44:39
guess these are again the counts okay
44:45
all right so if you're looking for the
44:48
longest substring that occurs at least
44:50
twice okay this is a candidate because
44:55
it says a BB B occurs twice this fellow
45:02
here is a candidate because it says that
45:04
B occurs seven times this guy's a
45:07
candidate it says that BB occurs five
45:09
times this guy is a candidate here it
45:12
says that BB B occurs three times so
45:18
I'll traverse my tree looking at the
45:21
branch nodes
45:21
I see which branch node has a count at
45:25
least equal to what I'm looking for and
45:27
then from amongst those I select the one
45:29
with the largest chart number so that
45:35
gives me the longest occurring string
45:37
longest repeat string that repeats at
45:39
least M times okay you can do that in
45:43
order of n time the traversal of the
45:48
tree is going to take order of n time
45:53
okay finally let's look at the longest
45:57
common substring problem we're given two
46:01
strings s and T and you want to find the
46:04
longest string that's common to s and T
46:07
so for example if s is carport and T is
46:10
airport then the longest substring is
46:14
our ports of length 4 so length 5 it
46:17
occurs in s and it also occurs in T the
46:24
longest common subsequence however is of
46:26
length 5 okay
46:28
so these are two different problems
46:29
longest common subsequence and longest
46:31
common substring the longest common
46:35
subsequence problem you've probably see
46:37
in a dynamic programming solution to
46:38
that and I algorithms course kind of a
46:40
standard problem that studied in dynamic
46:42
programming that runs in time which is
46:46
the product of the lengths of s and T
46:49
but the substring problem can be done in
46:51
time linear in the lengths of s and T by
46:55
constructing a suitable suffix tree so
46:59
what you do is you create a new string s
47:03
put a new symbol in between T and then
47:06
our pound sign and you build the suffix
47:12
tree for this new string U and that's
47:16
going to take you time linear in the
47:20
lengths of s and T so in our case for
47:25
carport and airports I would end up
47:27
building this particular string then you
47:32
make the observation that if you have a
47:36
repeating substring it cannot include
47:38
this fellow okay because this only
47:41
occurs once so repeating substrings have
47:47
the property that either they occur
47:50
somewhere here and again here or they
47:53
occur in airports somewhere on the left
47:56
and again on the right or maybe you got
47:59
an occurrence here and an occurrence
48:00
over there what you're interested in is
48:03
an occurrence here and an occurrence
48:05
over there and the longest such fellow
48:08
okay so we're looking for a longest
48:12
repeating substring with the property
48:15
that one occurrence begins to the left
48:20
of the dollar sign and
48:21
other occurrence begins to the right of
48:23
the dollar sign okay and what that means
48:28
is you set up a suffix tree you do a
48:34
traversal and you label the branch nodes
48:37
with the property that if a branch node
48:40
has a yellow node where one begins is on
48:42
the left of the doll and one on the
48:44
right you say this is a likely candidate
48:46
otherwise you exit and say this is not a
48:48
candidate okay so that way you can
48:50
identify which branch nodes are
48:52
potentials for the solution and then you
48:55
want to find the branch node which is
48:57
identified as a potential with the
48:59
longest charge number on it okay so
49:05
again in linear time you can construct
49:08
the suffix tree and perform a traversal
49:10
on the tree labeling the branch node
49:11
suitably and then identifying the one
49:14
with the longest with the largest chart
49:16
number on it and I will give you the
49:18
longest common substring and the whole
49:21
process takes time linear in the lengths
49:24
of the two strings SMT all right so
49:28
those are just some of the applications
49:31
of suffix trees but hopefully we've seen
49:33
enough to see there is kind of a nice
49:34
data structure to solve problems on
49:37
strings okay we'll stop here