Transcript 00:11 [Music] 00:15 okay all right today we're going to 00:20 start looking at a data structure that's 00:24 called a suffix tree I'd like to say 00:27 we're going to be looking at a new data 00:28 structure but actually as we'll see a 00:31 suffix tree is just an application of a 00:35 compressed try not necessarily a 00:37 compressed binary try a suffix trees 00:40 that compress try in which the keys are 00:45 constrained to be the suffixes of some 00:48 given strength okay so you start with a 00:51 given string you extract all of its 00:53 suffixes those become the keys you 00:56 create a compress try for those keys and 00:58 you get something that is known as a 01:00 suffix tree a suffix trees are useful in 01:04 solving a variety of problems that are 01:08 based on strings in an efficient manner 01:11 okay so to describe these problems let's 01:16 first define what we mean by a string 01:21 the substring and a subsequence and 01:24 actually it might help if I don't know 01:28 can you see that or yes not as bright as 01:31 getting brighter okay all right so by a 01:36 string we'll simply mean any sequence of 01:39 characters by a substring will mean some 01:45 part of that string that is obtained by 01:49 lopping off perhaps the left end of s 01:52 and the right end of s so you take some 01:54 contiguous set of characters inside the 01:57 string they'll define a substring okay 02:01 so for example if you have a string 02:04 cater then eight would be a substring 02:08 it's a contiguous set of characters 02:11 taken out of cater another way to 02:13 get it is you've lopped off the left end 02:15 of the string and you've lopped off some 02:18 right portion of the string what's left 02:20 behind is a substring okay the string 02:26 car is not a substring because it does 02:29 not represent any contiguous sequence of 02:32 characters within cater okay 02:34 alternatively you can't obtain it by 02:37 lopping off some left part in this case 02:39 you lop off in no left part and that 02:42 could get you the C and the a but 02:45 there's no right part that you can lop 02:46 off to get the R to come after the a the 02:53 empty string is a substring of s okay so 02:58 a substring is a contiguous set of 03:00 characters taken out of a given string a 03:06 subsequence on the other hand is it's 03:14 like a substring except that the 03:16 characters don't have to be contiguous 03:18 okay so you start with a string you 03:20 throw out any characters that you want 03:22 from there from the left from the right 03:24 from the middle from anywhere you like 03:25 and what you left behind would is a 03:27 substring a sorry a subsequence okay so 03:32 for example with the same starting with 03:35 the same string Kater okay eight is a 03:40 subsequence obtained by throwing out the 03:44 first character the fourth and the sorry 03:48 the first character and the fifth 03:49 character car is also a subsequence 03:54 obtained by throwing out the third and 03:57 fourth characters okay so car is a 04:01 subsequence but it's not a substring the 04:06 empty string is a subsequence so every 04:13 substring is a subsequence but some sub 04:15 sequences are not sub strings or any 04:19 questions about the distinction between 04:20 a substring and a subsequence 04:25 all right let's take a look at the 04:30 string pattern matching problem okay in 04:33 this problem you're given two entities 04:36 one is a string is called that s another 04:40 entity is a pattern let's call that P 04:43 sub I and the objective is to determine 04:49 whether or not P sub I is a substring of 04:52 s okay and in general applications you 05:00 not only want to determine whether or 05:02 not the substring but you'd also like to 05:03 know an occurrence of this substring so 05:07 maybe you want to find the first 05:08 occurrence of any occurrence of the 05:11 substring within the string yes you're 05:16 probably familiar with the 05:18 knuth-morris-pratt algorithm for string 05:22 matching if you're not it doesn't really 05:25 matter we're not going to look at the 05:27 details of it just the complexity right 05:33 this particular algorithm is able to 05:35 determine whether a pattern piece of I 05:37 is a substring of a string s and amount 05:40 of time this linear in the length of s 05:42 and the length of the pattern if you're 05:49 going to perform a large number of 05:51 queries of this type where the string 05:54 the string is constant okay and you're 05:58 just interrogating the string to find 06:00 out whether p1 is a substring P to the 06:04 substring p3 is a substring and so on so 06:08 for example you might do this in a in an 06:12 elementary version of a program designed 06:14 to check for plagiarism okay so your 06:18 string s may be somebody's book take the 06:22 whole thing represented as one long 06:24 string and now you have a paragraph that 06:26 somebody wrote you want to know was it 06:28 taken from this book so that paragraph 06:31 would be the pattern and you want to 06:34 know does it occur anywhere in there 06:35 well is it a substring of the book 06:38 in a more sophisticated application 06:42 you'd also want to look for where 06:45 there's some simple mutations of the 06:46 paragraph or a substring but in a 06:49 simplistic way you want to do an exact 06:51 match does this paragraph occur in this 06:54 book so the paragraph is the pattern and 06:57 the whole book is the string yes and you 07:01 may be doing this thing many times okay 07:04 so you get different patterns coming in 07:06 and you want to check each of these to 07:08 see if they have been taken from a 07:10 particular book so the string s namely 07:13 the book remains stationary it's that 07:16 it's a constant entity and the pattern 07:19 keeps changing in time also in this type 07:23 of application the string is very very 07:26 long compared to the pattern okay so the 07:29 pattern might be three sentences or four 07:31 sentences long the book might be 800 07:33 pages long so so one one place where you 07:39 may want to do the string pattern 07:44 matching problem repeatedly where the 07:47 string remains the same and the pattern 07:49 keeps changing is in this simplistic 07:50 plagiarism tester and so in that case if 07:56 you went the knuth-morris-pratt route 07:58 you would spend time that was n if you 08:01 were doing n searches so you've got n 08:03 different paragraphs you're checking for 08:05 occurrence you would do n times the 08:07 length of the book plus the length of 08:10 the end patterns and as you would expect 08:14 in this particular application this term 08:16 here would be the dominating term 08:18 because the string is much much larger 08:20 than the time okay so the complexity is 08:23 really order of n times the length of 08:26 the book 08:30 as we'll see if you go the suffix tree 08:33 route okay 08:36 the complexity will look like this okay 08:40 so it is the length of the book plus the 08:46 sum of the lengths of the patterns okay 08:48 that this sum here could be small 08:52 relative to this really you're seeing 08:55 maybe a factor of n improvement and 08:58 being the number of patterns even if n 09:05 is very large so that this thing here 09:10 becomes much bigger than that this term 09:12 here still represents a significant 09:13 improvement in runtime over that okay so 09:19 the suffix tree route would give us a 09:20 much faster solution and the difference 09:27 comes about because the 09:28 knuth-morris-pratt algorithm in order to 09:31 perform its search starts by 09:35 pre-processing the query okay so it 09:38 spends an amount of time that's linear 09:40 in the length of the query pattern P sub 09:44 I to pre-process it essentially sets up 09:47 a finite automata for P sub I and then 09:50 uses that finite automata against the 09:53 string s to find out whether or not that 09:55 pattern occurs with an S and that search 10:00 then takes in the amount of time equal 10:02 to the length of s the suffix tree 10:12 solution that we're going to see instead 10:14 of processing the pattern each time it 10:16 is going to process the string once and 10:19 set up a try for that string a suffix 10:23 tree all of the suffixes of that string 10:26 would be taken and a try would be 10:28 created and we'll be able to construct 10:31 that suffix tree in time linear in the 10:33 length of s and then each time you get a 10:37 pattern there's no pre-processing done 10:39 on the pattern you do a raw search in 10:41 the try for that pattern taking an 10:42 amount of time linear in the length of 10:44 the 10:44 and that's how you get the length of s 10:48 plus some of the lengths of P sub I time 10:50 again they'll become clearer when you 10:52 look at the data structure or another 10:56 application of string matching beyond 10:58 the simplistic pleasure and plagiarism 11:01 tester would be if you look at what goes 11:06 on in gene sequence matching okay so you 11:11 have a databank or a database of strings 11:13 these are really gene sequences that 11:15 people know about and you may have 11:19 hundreds of thousands of these varying 11:22 in length from maybe small ones about 20 11:26 going up to large ones which may be 11:27 10,000 characters or more in these 11:32 strings or gene sequences are strings of 11:38 characters limited to a set of four 11:41 characters ATG and F and then basically 11:45 what happens is other researchers in the 11:48 field perhaps develop a gene sequence 11:54 and they'd like to know whether this 11:56 they discovered something new or 11:57 something that exists so you submit the 12:00 sequence you have to a databank of known 12:02 sequences to find out whether or not the 12:06 sequence exists and really what you're 12:08 asking for is not whether this sequence 12:10 exists as a single separate whole entity 12:12 but does this new sequence of what you 12:15 considers and you think maybe a new 12:17 sequence does it occur as a substring of 12:20 a known gene sequence okay so that gives 12:24 you again the substring matching problem 12:26 except that now you're looking for 12:29 whether your given pattern piece of why 12:32 that's the thing you think you've 12:33 discovered as a new gene sequence does 12:36 it occur the substring in any one of a 12:38 hundred thousand known sequences okay so 12:41 it's kind of a bigger problem again 12:44 there that's kind of a simplistic 12:45 version of the problem you actually are 12:49 looking even for close matches but in a 12:54 simplistic version you're looking for an 12:55 exact match 12:57 and in fact there are services available 13:00 where researchers simply every day just 13:04 submit their sequence and maybe half an 13:07 hour later they get back a response 13:08 which says here are ten that are close 13:10 to what you have okay alright so those 13:18 are just two possible applications for 13:22 string pattern matching alright so with 13:27 that as motivation let's take a look at 13:29 the suffix tree data structure as I said 13:34 it's basically a compressed tri edge 13:37 information means that from any node you 13:39 can figure out what characters were 13:40 skipped as you go by edges when you're 13:45 going down the tree the keys in the 13:50 compressed tri are the suffixes of a 13:54 given string s so if you're thinking 13:58 about a book then you would extract all 14:00 of the suffixes of the book and 14:01 certainly in the case of a book you get 14:03 a very large number of suffixes which 14:04 means you're going to be building a huge 14:07 data structure 14:12 all right for our example we're going to 14:15 look at something much smaller just look 14:19 at this 7 character string yes so if I 14:24 had to build a suffix tree for this 7 14:26 character string s the keys that this 14:29 suffix tree would contain would be the 7 14:32 non empty suffixes of sleeper the 14:37 longest one would be the entire thing 14:38 obtained by lopping off nothing from the 14:41 left or the right syrup for the suffixes 14:45 you only lop off from the left hand ok 14:47 so the longest suffix will be sleeper 14:50 the next one would be leaper and then 14:53 you'd get eeper and then a / and / an or 14:59 and all right so we got seven suffixes 15:03 and our compressed try is going to 15:05 contain these seven keys and before 15:11 looking at the compressed right let's 15:14 see why are we interested in building a 15:16 try of suffixes how does it relate to 15:22 pattern matching so the claim is that a 15:26 pattern P sub I is a substring of a 15:29 given string s if and only if that 15:32 pattern is the front part of some suffix 15:36 okay so if we look back here at our 15:41 seven not suffixes okay and I want to 15:47 ask the question is leap a substring of 15:51 sleeper okay so here I can visually see 15:53 that it is at the same time if I look at 15:57 any one occurrence of leap here I know 16:01 that leap is the front part of a suffix 16:03 that begins at this occurrence and goes 16:06 all the way to the end 16:07 okay so leap is a front part of the 16:10 suffix leaper 16:11 similarly eep-eep is a front part of the 16:16 suffix e / p is a front part of the 16:22 suffix / 16:23 leap if it had occurred in sleeper it 16:29 would be the front part of a suffix 16:32 corresponding to the occurrence of leap 16:34 okay 16:35 the fact that leap is not the front part 16:37 of any of the suffixes tells us that 16:40 leap doesn't occur in sleeper same thing 16:45 for peel if peel had been a suffix it 16:48 would be the front part of one of the 16:50 listed sorry if if peel had been a 16:52 substring it would be the front part of 16:55 one of the listed suffixes since it's 16:57 not the front part of one of the listed 16:59 suffixes peel does not occur in sleeper 17:03 ok alright so the the reason we want to 17:06 have a try of suffixes is this 17:08 particular property here which says P 17:11 sub I is a substring of s if and only if 17:14 it is the front part or prefix of some 17:18 suffix of s does everybody agree with 17:22 that statement alright so if once we 17:30 make this observation we tie this with a 17:33 statement I made in an earlier lecture 17:36 which says one application of tries is 17:38 to keyword completion so if you're 17:43 sitting in UNIX and you want to type a 17:45 command you type the first three letters 17:46 and then you can hit a tab button and 17:49 it'll complete it if it's unique that's 17:51 because you can do prefix completion 17:53 using a try okay so I I can do prefix 17:59 searches in a try very quickly is a 18:03 given key the front part of one of the 18:06 key stored in the try that's a very 18:07 quick search to perform and so if I have 18:10 a try of suffixes then I can determine 18:13 if the pattern occurs by simply asking 18:16 is this pattern the front part of one of 18:18 the keys in the try now before looking 18:23 at the try before looking at a sample 18:28 suffix tree let's take a look at what we 18:31 have to do when the last character of s 18:34 the string you want to be searching and 18:37 repeats now if the last character and s 18:41 repeats then we can conclude that when 18:46 you make your list of keys the suffixes 18:48 what one of at least one of these is 18:51 going to be a proper prefix of another 18:55 okay so for example here if you look at 18:58 creeper the last character r appears in 19:01 creeper twice once at the end and once 19:03 somewhere else okay if I look at my list 19:08 of suffixes for creeper then R is a 19:12 proper prefix of Reaper okay and you 19:17 guarantee that will happen this last 19:20 suffix is going to be a proper prefix of 19:23 all of the suffixes where our occurs in 19:28 the string okay so if our red occurred 19:32 at two other places or would be a proper 19:36 prefix of the one beginning here plus 19:38 the one beginning at this other 19:39 occurrence okay so whenever the last 19:44 character of s occurs more than once we 19:48 know that the key is going into our try 19:51 violate a property we stated last time 19:55 which is that no key should be a proper 19:57 prefix of another okay and you can 20:02 convince yourself that that's the only 20:04 time that property is violated when the 20:06 last character occurs more than once 20:15 so to overcome this violation of the 20:18 requirement on tries that no key be a 20:21 proper prefix of another we will add a 20:25 special character to the end of s call 20:28 that the pound sign or amico and since 20:31 the Mokka occurs only once at the right 20:33 end we no longer have this problem here 20:40 okay 20:41 this marker occurs exactly once and so 20:43 you we will not have a violation of the 20:46 try property requirement so really I 20:52 will change creeper to creeper pound 20:54 sign and then build the suffix tree okay 21:04 alright so again you can visually see 21:06 that if you make a creeper pound sign 21:09 these would be your suffixes and now no 21:12 suffix or none of these keys is a proper 21:15 prefix of another key alright so for our 21:23 example I'll take a look at an abstract 21:25 string made from two different 21:29 characters a and B and then I've 21:31 appended the pound sign because B occurs 21:34 more than once in what I started off 21:36 with so the size of the alphabet has 21:41 gone from two characters a and B to 21:42 three a B and pound the try I'm going to 21:47 build is going to sample the key 21:50 character by characters so the radix in 21:53 this case would be a three radix so can 21:58 extract an A and a B and a pound right 22:05 to extract the suffixes from there build 22:08 the compressed tri nodes will have 22:12 three-way branches depending upon 22:14 whether you've extracted an A or B or 22:16 account this is what you end up with the 22:21 numbers outside give you the character 22:23 on which your branching 22:27 so here we're going to branch and 22:29 character one if the first character in 22:31 the keys and a will end up here if it's 22:35 a B you'll end up here and if it's a 22:38 pound sign you end up over there 22:41 here we are branching on character five 22:43 if character five is an a will be here 22:46 if it's a B will be there if it's a 22:48 pound well then I've got two no I don't 22:50 have that subtree here okay here I'm 22:54 branching on character two of your key 22:56 okay if it's an A you're down here if 22:59 it's a B you're here if it's a pound you 23:01 know with that right the edges here have 23:05 been labeled with the first one gives 23:09 you the character that was used to do 23:10 the branching okay the balance of it 23:13 gives you the characters that were 23:14 skipped over okay so this this foot over 23:23 here you're branching on character one 23:24 of the or key so this is character one 23:26 then this is two three and four which 23:28 were skipped and then character five is 23:31 used here for branching okay so this is 23:34 character five of the key than six seven 23:36 eight nine and ten 23:39 this is character five of the key and 23:41 that's character six so when you get 23:44 down here this node here okay 23:47 this element node represents a BBB a 23:51 bbbb town so it represents this whole 23:54 string yes okay this fellow here represe 24:00 by just concatenate concatenated labels 24:03 on the way down okay so this one here 24:05 represents a b b b b pound 24:09 okay so that'll be the fellow that 24:11 begins at this a and goes to the end 24:14 right this string here has as many 24:19 suffixes as i've got yellow or element 24:21 nodes here this guy here represents the 24:25 suffix pound which is just the rightmost 24:27 fellow this one here represents be pound 24:36 all right any questions about this try 24:45 alright if you look at this try if I was 24:48 to actually populate these nodes with 24:51 the keys that they represent I would not 24:54 be able to build the structure in linear 24:56 time 24:56 because the time required would at least 25:01 be the sum of the lengths of the keys 25:02 because I've got to store the keys in 25:04 these nodes okay and if you look at 25:06 suffixes you got one suffix of length 25:08 one one of length two one over length 25:09 three one over length four one over 25:11 length five six seven and so on and this 25:13 was of length n the sum of the lengths 25:16 of the keys being stored is n square 25:18 order of n square and so you'd have to 25:20 spend at least order of n square time 25:21 storing the keys in these nodes 25:28 similarly if you actually stored edge 25:32 labels on these edges you would have to 25:35 spend order of n square time okay so if 25:37 you add up the lengths of all of these 25:39 things it should work out to be about n 25:41 square yet we can construct this whole 25:46 thing in order n because we don't store 25:48 the keys explicitly in the yellow nodes 25:51 nor do we store the skipped over 25:54 characters and the branch character on 25:56 the edges right instead in these yellow 26:03 nodes okay what we're going to do is 26:06 we're going to store an index into the 26:08 string which says the suffix begins at 26:11 position one okay so this one here says 26:13 the key stored here is a suffix that 26:17 begins at position one of the strength 26:19 and since it's a suffix if you know 26:20 where it begins the ending is always 26:22 known it's the rightmost position okay 26:25 this thing here says the key stored here 26:28 is the suffix that begins at position 4 26:30 1 2 3 so it begins here at this B and 26:33 goes all the way this 9 here says that 26:37 the key stored here is the suffix that 26:39 begins at position 9 namely this B and 26:42 then goes all the way okay so we have an 26:45 indexing scheme the element nodes 26:48 actually 26:48 store numbers or indexes into the 26:50 strengths okay so the string is stored 26:52 once using order of n space and then in 26:57 each of these places 26:58 you keep an index into the string and 27:01 from that index you can reconstruct the 27:02 suffix or the key stored in that note 27:11 all right similarly for the edge 27:13 information in each branch node okay 27:19 you keep one of the index is stored in 27:26 that subtree okay basically this is an 27:30 idea we used in an earlier lecture on 27:33 tries where in each branch node you kept 27:36 a pointer to one of the element nodes so 27:38 that you can extract a key and then from 27:41 that key you can figure out which 27:42 characters were skipped over okay so the 27:45 idea is the same here but instead of 27:48 keeping a pointer to one of these nodes 27:50 and then extracting that keep that index 27:53 and then from the index getting the key 27:55 I just take that index and store it 27:57 directly here okay so in inside the 28:01 overall tree one of the index is 1 so 28:04 I've stored that number here logically 28:07 it's equivalent to storing a pointer to 28:08 this node and then from this node being 28:11 able to figure out what the key is okay 28:14 similarly in this node I've stored an 28:17 index corresponding to one of the 28:20 indexes stored in the yellow nodes in 28:22 that tree okay this fellow keeps one of 28:25 the indexes in this tree for its root 28:29 this one keeps one of the indexes two 28:32 for one of the other nodes in its 28:34 subtree so using this index you can get 28:40 to a suffix which is one of the keys 28:43 here and from that suffix you can figure 28:45 out which keys were skipped over or 28:47 which characters okay so for example if 28:51 you're doing a if you're following a 28:54 path so if you come from here to here 28:55 you know that here you're going to use 28:58 character 1 and it has to be an a get 29:01 that from here 29:02 and then you're going to come down here 29:04 where you're going to use character five 29:06 but before using character five you'd 29:08 like to check the characters two three 29:10 and four agree with your search key to 29:14 figure that out I'll use the numbers 29:16 stored out here the node that I reach 29:19 okay in this case it says one okay and 29:22 from here 29:23 okay it doesn't matter whether you get 29:25 to the one or the five because all of 29:27 these have the same characters two three 29:30 and four that's why we skipped over them 29:33 they also have the same character one 29:35 that's why they're in that subtree okay 29:37 so it doesn't matter which of these 29:39 fellows you have information about when 29:41 you get here because we're going to use 29:44 that key to figure out which characters 29:48 you skipped over and all of them have 29:50 the same values for the skipped over 29:52 characters okay so any one of those 29:54 would do so from here I use this number 29:57 one I get here 29:58 this fellow is the one that you branched 30:00 on then the next one gives you the first 30:03 character skipped over this gives you 30:05 the next one skipped over and that gives 30:06 you the third one skipped over okay so 30:08 from here I can extract the information 30:10 that was stored on this edge the first 30:12 one is the character you branched on and 30:14 the next three tell you what you skipped 30:16 over okay you know that you skipped over 30:19 three because you look at the character 30:21 number here is one the character number 30:23 there is five you subtract the two and 30:24 subtract another one and that tells you 30:27 how many you skipped over or if you just 30:29 subtract the two that tells you how many 30:30 characters are on this label okay so 30:36 storing the edge labels explicitly would 30:40 take order n square time and space 30:42 storing the keys of the suffixes 30:44 explicitly would take N squared time and 30:47 space we don't do that we want to have a 30:49 linear time and space construction for 30:53 the structure okay and so in the yellow 30:57 nodes we store indexes into the string 30:59 and in the branch nodes we also store 31:02 indexes into the string from these 31:05 indexes we can reconstruct the keys that 31:07 would otherwise have been stored in the 31:08 yellow nodes we can reconstruct the 31:10 labels that otherwise would have been 31:12 stored on the edges 31:17 okay so we're not going to look at the 31:20 construction algorithm in class if 31:22 you're interested you can check the 31:24 readings on the web 31:26 the readings describe an algorithm that 31:29 runs in order of n times our amount of 31:33 time and being the length of the string 31:35 are being the alphabet size if you're 31:38 looking at gene sequences you got only 31:41 four characters are as four if you're 31:43 going to put on the pound sign then R 31:45 becomes five okay so if R is considered 31:53 to be a constant then this becomes order 31:55 n the same algorithm described on the 32:02 web reading runs an order of an expected 32:04 time if you use a hash table instead 32:06 instead of array nodes there's a 32:12 reference there for to another round 32:14 with them which runs in order n time and 32:17 this is order n even under the 32:20 assumption that R is not a constant all 32:28 right what we want to do is we want to 32:30 look at how we're going to use a 32:32 substring how we're going to use a 32:34 suffix tree to solve string type 32:37 problems okay all right so first let's 32:42 take a look at the problem we started 32:43 with which is the substring matching 32:45 problem okay so the string is known to 32:50 me ahead of time I run my suffix tree 32:54 construction algorithm and I build this 32:55 and although we haven't looked at the 32:58 algorithm we know that it's going to run 33:01 in order of n time okay and it's also 33:04 going to take order of n space and being 33:06 the length of the string now having 33:11 built this in order of n time then I get 33:13 a request to determine whether or not be 33:16 ABB is a substring and if it is tell me 33:19 where it occurs well I start at the top 33:22 and this says branch and character one I 33:24 look at character one which is a B so I 33:26 end up here 33:27 okay then I this fellow says branch in 33:30 character - and character - is an a so 33:33 I'm going to follow this path okay now 33:35 when I'll follow a path instead of just 33:38 falling all the way down to the bottom 33:40 and then determining that you don't have 33:42 a match I'll do partial matches along 33:45 the way okay so here since you didn't 33:48 really skip over any character this 33:49 label is of length 1 and because you 33:53 followed this branch you have to have a 33:54 match okay here this label here is of 33:58 length 1 2 3 4 5 6 so there were 5 34:00 characters that was skipped the a 34:03 corresponds to the branch so you have to 34:05 match with the a but the other 5 you may 34:08 not match with when you follow this so 34:10 as you follow these edges once you get 34:12 here you use the pointer or index stored 34:15 here into this thing to figure out what 34:17 you skipped over and you match the rest 34:20 okay so the B and the a have already 34:22 been matched the B and the a were 34:24 matched because of ranching and then you 34:26 take the rest of it BB and it matches 34:29 this part here okay 34:31 so you know there's a match you finished 34:34 with this string so you're done with 34:36 checking okay so at this point you know 34:38 there's a match and if you want to know 34:41 where it occurs you take the index that 34:43 is stored out here okay because you 34:46 finished in the middle of an edge you 34:47 just take the index that's stored here 34:49 it'll tell you one place where it'll get 34:53 you to one of the yellow nodes in the 34:54 subtree in this case there's only one 34:56 yellow node which is itself but in a 34:59 more general situation you will arrive 35:00 at a branch node and from there you take 35:03 the index stored in there get to one of 35:05 these fellows and read off the 35:07 occurrence of that substring 35:11 let's try another one ABB B a so you 35:15 start out here branch and character 1 35:17 it's an a we're going to be skipping 35:20 over BBB so I do a match find that 35:24 they're equal then I do a branch and 5 35:27 which is an A ok and you end up here 35:31 when you follow an edge you match the 35:34 skipped over characters here there are 35:36 no skipped over characters to do the 35:39 match so again you kind of terminate in 35:40 the middle 35:41 of an edge and any yellow node down 35:44 there is a matching substring if you do 35:53 a BA BA okay you start out here you 35:57 branch and character one you end up here 35:59 you branch on character two you're going 36:01 to follow this link when you follow a 36:04 link you match on the skipped over 36:05 characters so the B match there the a 36:09 match there the next B matches here but 36:11 the a fails to match that B that tells 36:13 you that there is no occurrence of ba PA 36:22 all right the time required to do the 36:25 match is linear in the length of the 36:26 pattern okay so as you go down you look 36:30 at each character in the pattern once 36:31 and based on that you either find a 36:33 match or you do a branch or if it's a 36:35 non match you stop so you can do 36:41 substring matching in time linear in the 36:45 pattern there is an overhead price that 36:50 you pay in the beginning to set up your 36:52 suffix tree but that's linear in the 36:53 length of the string all right any 37:01 questions about how to use a suffix tree 37:03 to do substring matching 37:15 but suppose you wanted to find all 37:17 occurrences of a pattern not just one so 37:26 we'll follow the search process we just 37:28 had for the pattern and let's suppose 37:34 that the search is successful okay if 37:38 the search terminates at the yellow node 37:42 then that's the only occurrence of that 37:44 pattern on the other hand if the search 37:48 terminates at a branch node all yellow 37:50 nodes in that subtree are occurrences 37:53 and they are the only occurrences so for 37:58 example here if you're searching for a 38:01 for B's and a pound sign so we're going 38:07 to start out here branch and character 1 38:10 compare the rest of this or ABV beefs 38:13 all of match you come here and we look 38:15 at character 5 which is a B you're going 38:17 to follow this you have a match with 38:18 being a pound sign you end up here so 38:23 this tells you that the pattern is a 38:25 substring one occurrence of it is at 38:29 position 5 further there are no other 38:32 occurrences ok so you can get all of the 38:38 occurrences there's only one of them and 38:39 it begins at position 5 look at an 38:45 example 38:46 oh I don't think I have a picture for 38:48 this ok so what we I do ok all right so 38:54 let's suppose ok instead you're 38:57 searching for a B okay so you start out 39:00 here you branch in character 1 you match 39:04 over the skipped characters so the 39:06 search finishes really in the middle on 39:09 this edge but you end up eventually at 39:11 this node ok this is a branch node a 39:14 branch node is guaranteed to have more 39:17 than one yellow node in its subtree 39:19 because this is a compressed try so that 39:21 tells us that there's more than one 39:24 occurrence and in fact the number of 39:27 occurrences is 39:28 exactly equal to the number of yellow 39:29 nodes in the subtree or subtree that's 39:32 routed here okay I can get all of the 39:37 occurrences if I augment the structure 39:40 and the time of construction okay so 39:44 what I do is I'm going to augment it by 39:49 linking together all of the element 39:55 nodes or yellow nodes in in order okay 39:59 and in each branch node I'm going to 40:01 keep a pointer to the first yellow node 40:04 in that subtree and to the last yellow 40:06 node in the subtree okay so if I look at 40:11 my so here first thing I'm going to do 40:13 is I want to link all of these yellow 40:15 nodes together okay and they're going to 40:19 be linked together to form a chain link 40:21 together in in order so create this 40:27 chain of yellow nodes in each branch 40:30 node I'm going to keep a pointer to the 40:32 first and last node in this chain for 40:35 that corresponding subtree so if I look 40:41 at this subtree here the first yellow 40:46 node is number 1 and the second yellow 40:49 node is number 5 if you look at the 40:54 overall tree the first yellow node is 40:56 number 1 and the last yellow node is 40:58 number 10 if you look at this fellow 41:01 here first one is number 4 last one is 41:05 number 9 the first one is number 3 41:08 last one is 8 first one is number 2 last 41:13 one is 7 so each branch node keeps track 41:17 of the first and last nodes on the chain 41:19 for its subtree so then when you reach a 41:23 particular branch node suppose you're 41:25 doing a search and the search terminates 41:27 here and you want to report all 41:29 occurrences of B well then you'd use 41:33 this yellow pointer to get to the first 41:35 occurrence okay which is this fellow and 41:37 then you just follow down the chain 41:38 until you reach the last fellow 41:42 you don't have to do any compares if you 41:45 had to do compares your you'd run in 41:47 quadratic time okay 41:50 so because of these pointers no compares 41:54 aren't needed you know that there's a 41:56 match everywhere you just follow down 41:58 the chain in one unit for occurrence you 42:01 can report the occurrence so you get a 42:07 fast scheme to report all occurrences of 42:09 a pattern in a given string the chaining 42:16 and setting up of these pointers is done 42:18 only once okay you want to find the 42:26 longest repeating substring given a 42:31 string and somebody says well find the 42:33 longest string that occurs at least six 42:35 times now find the longest string that 42:37 occurs at least ten times or two times 42:39 or one so whatever okay sounds like a 42:43 formidable problem but turns out exactly 42:46 very simple when you construct the try 42:50 you leap you label each branch node with 42:53 the number of yellow nodes in it and 43:01 then your task becomes to find a branch 43:08 node with a large enough label okay so 43:12 long as you're looking for an M greater 43:13 than one you have to be at a branch node 43:16 because only the branch nodes represent 43:19 occurrences of things that occur more 43:20 than once so you're looking for a label 43:23 for a branch node with a large enough 43:24 label and the charge number on the 43:31 branch node gives you the length of the 43:36 string that is occurring oh that is 43:39 recurring okay take a look at an example 43:42 here okay see the chart number here is 43:46 five okay and what the five tells you is 43:49 that these yellow nodes here represent 43:52 occurrences of a string of length four 43:54 ABB 43:57 the chart number here is 2 and what that 43:59 tells you is the yellow nodes down here 44:01 represent occurrences of a string of 44:03 length 1 namely B the chart number here 44:08 I think was a 3 yeah so that tells you 44:13 that the yellow nodes here represent 44:15 occurrences of a string of length to be 44:17 B okay the count tells you how many 44:21 times it occurs okay so the chart number 44:25 and that the chart number tells you the 44:27 length of the repeating substring and 44:31 the count tells you how many times it 44:33 occurs okay so if you're looking for C I 44:39 guess these are again the counts okay 44:45 all right so if you're looking for the 44:48 longest substring that occurs at least 44:50 twice okay this is a candidate because 44:55 it says a BB B occurs twice this fellow 45:02 here is a candidate because it says that 45:04 B occurs seven times this guy's a 45:07 candidate it says that BB occurs five 45:09 times this guy is a candidate here it 45:12 says that BB B occurs three times so 45:18 I'll traverse my tree looking at the 45:21 branch nodes 45:21 I see which branch node has a count at 45:25 least equal to what I'm looking for and 45:27 then from amongst those I select the one 45:29 with the largest chart number so that 45:35 gives me the longest occurring string 45:37 longest repeat string that repeats at 45:39 least M times okay you can do that in 45:43 order of n time the traversal of the 45:48 tree is going to take order of n time 45:53 okay finally let's look at the longest 45:57 common substring problem we're given two 46:01 strings s and T and you want to find the 46:04 longest string that's common to s and T 46:07 so for example if s is carport and T is 46:10 airport then the longest substring is 46:14 our ports of length 4 so length 5 it 46:17 occurs in s and it also occurs in T the 46:24 longest common subsequence however is of 46:26 length 5 okay 46:28 so these are two different problems 46:29 longest common subsequence and longest 46:31 common substring the longest common 46:35 subsequence problem you've probably see 46:37 in a dynamic programming solution to 46:38 that and I algorithms course kind of a 46:40 standard problem that studied in dynamic 46:42 programming that runs in time which is 46:46 the product of the lengths of s and T 46:49 but the substring problem can be done in 46:51 time linear in the lengths of s and T by 46:55 constructing a suitable suffix tree so 46:59 what you do is you create a new string s 47:03 put a new symbol in between T and then 47:06 our pound sign and you build the suffix 47:12 tree for this new string U and that's 47:16 going to take you time linear in the 47:20 lengths of s and T so in our case for 47:25 carport and airports I would end up 47:27 building this particular string then you 47:32 make the observation that if you have a 47:36 repeating substring it cannot include 47:38 this fellow okay because this only 47:41 occurs once so repeating substrings have 47:47 the property that either they occur 47:50 somewhere here and again here or they 47:53 occur in airports somewhere on the left 47:56 and again on the right or maybe you got 47:59 an occurrence here and an occurrence 48:00 over there what you're interested in is 48:03 an occurrence here and an occurrence 48:05 over there and the longest such fellow 48:08 okay so we're looking for a longest 48:12 repeating substring with the property 48:15 that one occurrence begins to the left 48:20 of the dollar sign and 48:21 other occurrence begins to the right of 48:23 the dollar sign okay and what that means 48:28 is you set up a suffix tree you do a 48:34 traversal and you label the branch nodes 48:37 with the property that if a branch node 48:40 has a yellow node where one begins is on 48:42 the left of the doll and one on the 48:44 right you say this is a likely candidate 48:46 otherwise you exit and say this is not a 48:48 candidate okay so that way you can 48:50 identify which branch nodes are 48:52 potentials for the solution and then you 48:55 want to find the branch node which is 48:57 identified as a potential with the 48:59 longest charge number on it okay so 49:05 again in linear time you can construct 49:08 the suffix tree and perform a traversal 49:10 on the tree labeling the branch node 49:11 suitably and then identifying the one 49:14 with the longest with the largest chart 49:16 number on it and I will give you the 49:18 longest common substring and the whole 49:21 process takes time linear in the lengths 49:24 of the two strings SMT all right so 49:28 those are just some of the applications 49:31 of suffix trees but hopefully we've seen 49:33 enough to see there is kind of a nice 49:34 data structure to solve problems on 49:37 strings okay we'll stop here