Editing Knuth–Morris–Pratt algorithm (section)

==KMP algorithm==

===Example of the search algorithm===
To illustrate the algorithm's details, consider a (relatively artificial) run of the algorithm, where <code>W</code> = "ABCDABD" and <code>S</code> = "ABC ABCDAB ABCDABCDABDE". At any given time, the algorithm is in a state determined by two integers:

* <code>m</code>, denoting the position within <code>S</code> where the prospective match for <code>W</code> begins,
* <code>i</code>, denoting the index of the currently considered character in <code>W</code>.

In each step the algorithm compares <code>S[m+i]</code> with <code>W[i]</code> and increments <code>i</code> if they are equal. This is depicted, at the start of the run, like

<div style="font-family:monospace;">
              1         2  
 m: 01234567890123456789012
 S: {{color|blue|ABC}}{{color|red| }}{{color|gray|ABCDAB ABCDABCDABDE}}
 W: {{color|blue|ABC}}{{color|red|D}}{{color|gray|ABD}}
 i: {{color|blue|012}}{{color|red|3}}{{color|gray|456}}
</div>

The algorithm compares successive characters of <code>W</code> to "parallel" characters of <code>S</code>, moving from one to the next by incrementing <code>i</code> if they match.  However, in the fourth step <code>S[3]&nbsp;=&nbsp;'&nbsp;'</code> does not match <code>W[3] = 'D'</code>.  Rather than beginning to search again at <code>S[1]</code>, we note that no <code>'A'</code> occurs between positions 1 and 2 in <code>S</code>; hence, having checked all those characters previously (and knowing they matched the corresponding characters in <code>W</code>), there is no chance of finding the beginning of a match.  Therefore, the algorithm sets <code>m = 3</code> and <code>i = 0</code>.

<div style="font-family:monospace;">
              1         2  
 m: 01234567890123456789012
 S: ABC{{color|red| }}{{color|gray|ABCDAB ABCDABCDABDE}}
 W:    {{color|red|A}}{{color|gray|BCDABD}}
 i:    {{color|red|0}}{{color|gray|123456}}
</div>

This match fails at the initial character, so the algorithm sets <code>m = 4</code> and <code>i = 0</code>

<div style="font-family:monospace;">
              1         2  
 m: 01234567890123456789012
 S: ABC {{color|blue|ABCDAB}}{{color|red| }}{{color|gray|ABCDABCDABDE}}
 W:     {{color|blue|ABCDAB}}{{color|red|D}}
 i:     {{color|blue|012345}}{{color|red|6}}
</div>

Here, <code>i</code> increments through a nearly complete match <code>"ABCDAB"</code> until <code>i = 6</code> giving a mismatch at <code>W[6]</code> and <code>S[10]</code>.  However, just prior to the end of the current partial match, there was that substring <code>"AB"</code> that could be the beginning of a new match, so the algorithm must take this into consideration.  As these characters match the two characters prior to the current position, those characters need not be checked again; the algorithm sets <code>m = 8</code> (the start of the initial prefix) and <code>i = 2</code> (signaling the first two characters match) and continues matching.  Thus the algorithm not only omits previously matched characters of <code>S</code> (the <code>"AB"</code>), but also previously matched characters of <code>W</code> (the prefix <code>"AB"</code>).

<div style="font-family:monospace;">
              1         2  
 m: 01234567890123456789012
 S: ABC ABCD{{color|gray|AB}}{{color|red| }}{{color|gray|ABCDABCDABDE}}
 W:         {{color|gray|AB}}{{color|red|C}}{{color|gray|DABD}}
 i:         {{color|gray|01}}{{color|red|2}}{{color|gray|3456}}
</div>

This search at the new position fails immediately because <code>W[2]</code> (a <code>'C'</code>) does not match <code>S[10]</code> (a <code>'&nbsp;'</code>). As in the first trial, the mismatch causes the algorithm to return to the beginning of <code>W</code> and begins searching at the mismatched character position of <code>S</code>: <code>m = 10</code>, reset <code>i = 0</code>.

<div style="font-family:monospace;">
              1         2  
 m: 01234567890123456789012
 S: ABC ABCDAB{{color|red| }}{{color|gray|ABCDABCDABDE}}
 W:           {{color|red|A}}{{color|gray|BCDABD}}
 i:           {{color|red|0}}{{color|gray|123456}}
</div>

The match at <code>m=10</code> fails immediately, so the algorithm next tries <code>m = 11</code> and <code>i = 0</code>.

<div style="font-family:monospace;"> 
              1         2  
 m: 01234567890123456789012
 S: ABC ABCDAB {{color|blue|ABCDAB}}{{color|red|C}}{{color|gray|DABDE}}
 W:            {{color|blue|ABCDAB}}{{color|red|D}}
 i:            {{color|blue|012345}}{{color|red|6}}
</div>

Once again, the algorithm matches <code>"ABCDAB"</code>, but the next character, <code>'C'</code>, does not match the final character <code>'D'</code> of the word <code>W</code>.  Reasoning as before, the algorithm sets <code>m = 15</code>, to start at the two-character string <code>"AB"</code> leading up to the current position, set <code>i = 2</code>, and continue matching from the current position.

<div style="font-family:monospace;">
              1         2  
 m: 01234567890123456789012
 S: ABC ABCDAB ABCD{{color|blue|ABCDABD}}{{color|gray|E}}
 W:                {{color|blue|ABCDABD}}
 i:                {{color|blue|0123456}}
</div>

This time the match is complete, and the first character of the match is <code>S[15]</code>.

===Description of pseudocode for the search algorithm===
<!--Note to future editors: please do not replace or even supplement the pseudocode with language-specific code.  Following the WikiProject Computer Science manual of style, pseudocode is preferred over real code unless the real code illustrates a feature of the language or an implementation detail.  This algorithm is so simple that neither of these can ever be the case-->
The above example contains all the elements of the algorithm.  For the moment, we assume the existence of a "partial match" table <code>T</code>, described [[#"Partial match" table (also known as "failure function")|below]], which indicates where we need to look for the start of a new match when a mismatch is found.  The entries of <code>T</code> are constructed so that if we have a match starting at <code>S[m]</code> that fails when comparing <code>S[m + i]</code> to <code>W[i]</code>, then the next possible match will start at index <code>m + i - T[i]</code> in <code>S</code> (that is, <code>T[i]</code> is the amount of "backtracking" we need to do after a mismatch).  This has two implications: first, <code>T[0] = -1</code>, which indicates that if <code>W[0]</code> is a mismatch, we cannot backtrack and must simply check the next character; and second, although the next possible match will ''begin'' at index <code>m + i - T[i]</code>, as in the example above, we need not actually check any of the <code>T[i]</code> characters after that, so that we continue searching from <code>W[T[i]]</code>.  The following is a sample [[pseudocode]] implementation of the KMP search algorithm.
 '''algorithm''' ''kmp_search'':
     '''input''':
         an array of characters, S (the text to be searched)
         an array of characters, W (the word sought)
     '''output''':
         an array of integers, P (positions in S at which W is found)
         an integer, nP (number of positions)
 
     '''define variables''':
         an integer, j ← 0 (the position of the current character in S)
         an integer, k ← 0 (the position of the current character in W)
         an array of integers, T (the table, computed elsewhere)
 
     '''let''' nP ← 0
 
     '''while''' j < length(S) '''do'''
         '''if''' W[k] = S[j] '''then'''
             '''let''' j ← j + 1
             '''let''' k ← k + 1
             '''if''' k = length(W) '''then'''
                 (occurrence found, if only first occurrence is needed, m ← j - k  may be returned here)
                 '''let''' P[nP] ← j - k, nP ← nP + 1
                 '''let''' k ← T[k] (T[length(W)] can't be -1)
         '''else'''
             '''let''' k ← T[k]
             '''if''' k < 0 '''then'''
                 '''let''' j ← j + 1
                 '''let''' k ← k + 1

===Efficiency of the search algorithm===
Assuming the prior existence of the table <code>T</code>, the search portion of the Knuth–Morris–Pratt algorithm has [[Computational complexity theory|complexity]] [[Linear time#Linear time|''O''(''n'')]], where ''n'' is the length of <code>S</code> and the ''O'' is [[big-O notation]].  Except for the fixed overhead incurred in entering and exiting the function, all the computations are performed in the <code>'''while'''</code> loop. To bound the number of iterations of this loop; observe that <code>T</code> is constructed so that if a match which had begun at <code>S[m]</code> fails while comparing <code>S[m + i]</code> to <code>W[i]</code>, then the next possible match must begin at <code>S[m + (i - T[i])]</code>.  In particular, the next possible match must occur at a higher index than <code>m</code>, so that <code>T[i] < i</code>.

This fact implies that the loop can execute at most 2''n'' times, since at each iteration it executes one of the two branches in the loop.  The first branch invariably increases <code>i</code> and does not change <code>m</code>, so that the index <code>m + i</code> of the currently scrutinized character of <code>S</code> is increased.  The second branch adds <code>i - T[i]</code> to <code>m</code>, and as we have seen, this is always a positive number.  Thus the location <code>m</code> of the beginning of the current potential match is increased. At the same time, the second branch leaves <code>m + i</code> unchanged, for <code>m</code> gets <code>i - T[i]</code> added to it, and immediately after <code>T[i]</code> gets assigned as the new value of <code>i</code>, hence <code>new_m + new_i =  old_m + old_i - T[old_i] + T[old_i] = old_m + old_i</code>. Now, the loop ends if <code>m + i</code> = ''n''; therefore, each branch of the loop can be reached at most ''n'' times, since they respectively increase either <code>m + i</code> or <code>m</code>, and <code>m ≤ m + i</code>: if <code>m</code> = ''n'', then certainly <code>m + i</code> ≥ ''n'', so that since it increases by unit increments at most, we must have had <code>m + i</code> = ''n'' at some point in the past, and therefore either way we would be done.

Thus the loop executes at most 2''n'' times, showing that the time complexity of the search algorithm is ''O''(''n'').

Here is another way to think about the runtime:
Let us say we begin to match <code>W</code> and <code>S</code> at position <code>i</code> and <code>p</code>. If <code>W</code> exists as a substring of <code>S</code> at p, then <code>W[0..m] = S[p..p+m]</code>.
Upon success, that is, the word and the text matched at the positions (<code>W[i] = S[p+i]</code>), we increase <code>i</code> by 1.
Upon failure, that is, the word and the text do not match at the positions (<code>W[i] ≠ S[p+i]</code>), the text pointer is kept still, while the word pointer is rolled back a certain amount (<code>i = T[i]</code>, where <code>T</code> is the jump table), and we attempt to match <code>W[T[i]]</code> with <code>S[p+i]</code>.
The maximum number of roll-back of <code>i</code> is bounded by <code>i</code>, that is to say, for any failure, we can only roll back as much as we have progressed up to the failure.
Then it is clear the runtime is 2''n''.