Editing Knuth–Morris–Pratt algorithm (section)

==Background==
A string-matching algorithm wants to find the starting index <code>m</code> in string <code>S[]</code> that matches the search word <code>W[]</code>.

The most straightforward algorithm, known as the "[[Brute-force search|brute-force]]" or "naive" algorithm, is to look for a word match at each index <code>m</code>, i.e. the position in the string being searched that corresponds to the character <code>S[m]</code>. At each position <code>m</code> the algorithm first checks for equality of the first character in the word being searched, i.e. <code>S[m] =? W[0]</code>. If a match is found, the algorithm tests the other characters in the word being searched by checking successive values of the word position index, <code>i</code>. The algorithm retrieves the character <code>W[i]</code> in the word being searched and checks for equality of the expression <code>S[m+i] =? W[i]</code>.  If all successive characters match in <code>W</code> at position <code>m</code>, then a match is found at that position in the search string. If the index <code>m</code> reaches the end of the string then there is no match, in which case the search is said to "fail".

Usually, the trial check will quickly reject the trial match. If the strings are uniformly distributed random letters, then the chance that characters match is 1 in 26. In most cases, the trial check will reject the match at the initial letter. The chance that the first two letters will match is 1 in 26 (1 in 26^2 chances of a match over 26 possible letters). So if the characters are random, then the expected complexity of searching string <code>S[]</code> of length ''n'' is on the order of ''n'' comparisons or ''Θ''(''n'').  The expected performance is very good. If <code>S[]</code> is 1 million characters and <code>W[]</code> is 1000 characters, then the string search should complete after about 1.04 million character comparisons.

That expected performance is not guaranteed. If the strings are not random, then checking a trial <code>m</code> may take many character comparisons. The worst case is if the two strings match in all but the last letter. Imagine that the string <code>S[]</code> consists of 1 million characters that are all ''A'', and that the word <code>W[]</code> is 999 ''A'' characters terminating in a final ''B'' character. The simple string-matching algorithm will now examine 1000 characters at each trial position before rejecting the match and advancing the trial position. The simple string search example would now take about 1000 character comparisons times 1 million positions for 1 billion character comparisons.  If the length of <code>W[]</code> is ''k'', then the worst-case performance is ''O''(''k''&sdot;''n'').

The KMP algorithm has a better worst-case performance than the straightforward algorithm. KMP spends a little time precomputing a table (on the order of the size of <code>W[]</code>, ''O''(''k'')), and then it uses that table to do an efficient search of the string in ''O''(''n'').

The difference is that KMP makes use of previous match information that the straightforward algorithm does not. In the example above, when KMP sees a trial match fail on the 1000th character (<code>i</code> = 999) because <code>S[m+999] ≠ W[999]</code>, it will increment <code>m</code> by 1, but it will know that the first 998 characters at the new position already match.  KMP matched 999 ''A'' characters before discovering a mismatch at the 1000th character (position 999). Advancing the trial match position <code>m</code> by one throws away the first ''A'', so KMP knows there are 998 ''A'' characters that match <code>W[]</code> and does not retest them; that is, KMP sets <code>i</code> to 998. KMP maintains its knowledge in the precomputed table and two state variables. When KMP discovers a mismatch, the table determines how much KMP will increase (variable <code>m</code>) and where it will resume testing (variable <code>i</code>).