Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Knuth–Morris–Pratt algorithm
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=="Partial match" table (also known as "failure function")== The goal of the table is to allow the algorithm not to match any character of <code>S</code> more than once. The key observation about the nature of a linear search that allows this to happen is that in having checked some segment of the main string against an ''initial segment'' of the pattern, we know exactly at which places a new potential match which could continue to the current position could begin prior to the current position. In other words, we "pre-search" the pattern itself and compile a list of all possible fallback positions that bypass a maximum of hopeless characters while not sacrificing any potential matches in doing so. We want to be able to look up, for each position in <code>W</code>, the length of the longest possible initial segment of <code>W</code> leading up to (but not including) that position, other than the full segment starting at <code>W[0]</code> that just failed to match; this is how far we have to backtrack in finding the next match. Hence <code>T[i]</code> is exactly the length of the longest possible ''proper'' initial segment of <code>W</code> which is also a segment of the substring ending at <code>W[i - 1]</code>. We use the convention that the empty string has length 0. Since a mismatch at the very start of the pattern is a special case (there is no possibility of backtracking), we set <code>T[0] = -1</code>, as discussed [[#Description of pseudocode for the table-building algorithm|below]]. ===Working example of the table-building algorithm=== We consider the example of <code>W = "ABCDABD"</code> first. We will see that it follows much the same pattern as the main search, and is efficient for similar reasons. We set <code>T[0] = -1</code>. To find <code>T[1]</code>, we must discover a [[Substring#Suffix|proper suffix]] of <code>"A"</code> which is also a prefix of pattern <code>W</code>. But there are no proper suffixes of <code>"A"</code>, so we set <code>T[1] = 0</code>. To find <code>T[2]</code>, we see that the substring <code>W[0]</code> - <code>W[1]</code> (<code>"AB"</code>) has a proper suffix <code>"B"</code>. However "B" is not a prefix of the pattern <code>W</code>. Therefore, we set <code>T[2] = 0</code>. Continuing to <code>T[3]</code>, we first check the proper suffix of length 1, and as in the previous case it fails. Should we also check longer suffixes? No, we now note that there is a shortcut to checking ''all'' suffixes: let us say that we discovered a [[Substring#Suffix|proper suffix]] which is a [[Substring#Prefix|proper prefix]] (a proper prefix of a string is not equal to the string itself) and ending at <code>W[2]</code> with length 2 (the maximum possible); then its first character is also a proper prefix of <code>W</code>, hence a proper prefix itself, and it ends at <code>W[1]</code>, which we already determined did not occur as <code>T[2] = 0</code> and not <code>T[2] = 1</code>. Hence at each stage, the shortcut rule is that one needs to consider checking suffixes of a given size m+1 only if a valid suffix of size m was found at the previous stage (i.e. <code>T[x] = m</code>) and should not bother to check m+2, m+3, etc. Therefore, we need not even concern ourselves with substrings having length 2, and as in the previous case the sole one with length 1 fails, so <code>T[3] = 0</code>. We pass to the subsequent <code>W[4]</code>, <code>'A'</code>. The same logic shows that the longest substring we need to consider has length 1, and as in the previous case it fails since "D" is not a prefix of <code>W</code>. But instead of setting <code>T[4] = 0</code>, we can do better by noting that <code>W[4] = W[0]</code>, and also that a look-up of <code>T[4]</code> implies the corresponding <code>S</code> character, <code>S[m+4]</code>, was a mismatch and therefore <code>S[m+4] ≠ 'A'</code>. Thus there is no point in restarting the search at <code>S[m+4]</code>; we should begin at 1 position ahead. This means that we may shift pattern <code>W</code> by match length plus one character, so <code>T[4] = -1</code>. Considering now the next character, <code>W[5]</code>, which is <code>'B'</code>: though by inspection the longest substring would appear to be <code>'A'</code>, we still set <code>T[5] = 0</code>. The reasoning is similar to why <code>T[4] = -1</code>. <code>W[5]</code> itself extends the prefix match begun with <code>W[4]</code>, and we can assume that the corresponding character in <code>S</code>, <code>S[m+5] ≠ 'B'</code>. So backtracking before <code>W[5]</code> is pointless, but <code>S[m+5]</code> may be <code>'A'</code>, hence <code>T[5] = 0</code>. Finally, we see that the next character in the ongoing segment starting at <code>W[4] = 'A'</code> would be <code>'B'</code>, and indeed this is also <code>W[5]</code>. Furthermore, the same argument as above shows that we need not look before <code>W[4]</code> to find a segment for <code>W[6]</code>, so that this is it, and we take <code>T[6] = 2</code>. Therefore, we compile the following table: {| class="wikitable" style="background-color:white; font-family:monospace; text-align:right" !<code>i</code> | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |- !<code>W[i]</code> | A | B | C | D | A | B | D | |- !<code>T[i]</code> | -1 | 0 | 0 | 0 | -1 | 0 | 2 | 0 |} Another example: {| class="wikitable" style="background-color:white; font-family:monospace; text-align:right" !<code>i</code> | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |- !<code>W[i]</code> | A | B | A | C | A | B | A | B | C | |- !<code>T[i]</code> | -1 | 0 | -1 | 1 | -1 | 0 | -1 | 3 | 2 | 0 |} Another example (slightly changed from the previous example): {| class="wikitable" style="background-color:white; font-family:monospace; text-align:right" !<code>i</code> | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |- !<code>W[i]</code> | A | B | A | C | A | B | A | B | A | |- !<code>T[i]</code> | -1 | 0 | -1 | 1 | -1 | 0 | -1 | 3 | -1 | 3 |} Another more complicated example: {| class="wikitable" style="background-color:white; font-family:monospace; text-align:right" !<code>i</code> | 00 | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 |- !<code>W[i]</code> | P | A | R | T | I | C | I | P | A | T | E | | I | N | | P | A | R | A | C | H | U | T | E | |- !<code>T[i]</code> | -1 | 0 | 0 | 0 | 0 | 0 | 0 | -1 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | -1 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |} ===Description of pseudocode for the table-building algorithm=== <!--Note to future editors: please do not replace or even supplement the pseudocode with language-specific code. Following the WikiProject Computer Science manual of style, pseudocode is preferred over real code unless the real code illustrates a feature of the language or an implementation detail. This algorithm is so simple that neither of these can ever be the case--> <!--This code appears to implement the table for the Morris-Pratt algorithm, not for the Knuth-Morris-Pratt Algorithm. The KMP table has bigger shifts. See http://www-igm.univ-mlv.fr/~lecroq/string/node8.html--> The example above illustrates the general technique for assembling the table with a minimum of fuss. The principle is that of the overall search: most of the work was already done in getting to the current position, so very little needs to be done in leaving it. The only minor complication is that the logic which is correct late in the string erroneously gives non-proper substrings at the beginning. This necessitates some initialization code. '''algorithm''' ''kmp_table'': '''input''': an array of characters, W (the word to be analyzed) '''output''': an array of integers, T (the table to be filled) '''define variables''': an integer, pos ← 1 (the current position we are computing in T) an integer, cnd ← 0 (the zero-based index in W of the next character of the current candidate substring) '''let''' T[0] ← -1 '''while''' pos < length(W) '''do''' '''if''' W[pos] = W[cnd] '''then''' '''let''' T[pos] ← T[cnd] '''else''' '''let''' T[pos] ← cnd '''while''' cnd ≥ 0 '''and''' W[pos] ≠ W[cnd] '''do''' '''let''' cnd ← T[cnd] '''let''' pos ← pos + 1, cnd ← cnd + 1 '''let''' T[pos] ← cnd (only needed when all word occurrences are searched) ===Efficiency of the table-building algorithm=== The time (and space) complexity of the table algorithm is <math>O(k)</math>, where <math>k</math> is the length of <code>W</code>. * The outer loop: <code>pos</code> is initialized to 1, the loop condition is <code>pos < k</code>, and <code>pos</code> is increased by 1 in every iteration of the loop. Thus the loop will take <math>k - 1</math> iterations. * The inner loop: <code>cnd</code> is initialized to <code>0</code> and gets increased by at most 1 in each outer loop iteration. <code>T[cnd]</code> is always less than <code>cnd</code>, so <code>cnd</code> gets decreased by at least 1 in each inner loop iteration; the inner loop condition is <code>cnd ≥ 0</code>. This means that the inner loop can execute at most as many times in total, as the outer loop has executed – each decrease of <code>cnd</code> by 1 in the inner loop needs to have a corresponding increase by 1 in the outer loop. Since the outer loop takes <math>k - 1</math> iterations, the inner loop can take no more than <math>k - 1</math> iterations in total. Combined, the outer and inner loops take at most <math>2k-2</math> iterations. This corresponds to <math>O(k)</math> time complexity using the [[Big O notation]].
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)