Editing Suffix array (section)

== Applications ==

The suffix array of a string can be used as an [[Index (search engine)|index]] to quickly locate every occurrence of a substring pattern <math>P</math> within the string <math>S</math>. Finding every occurrence of the pattern is equivalent to finding every suffix that begins with the substring. Thanks to the lexicographical ordering, these suffixes will be grouped together in the suffix array and can be found efficiently with two [[binary search]]es. The first search locates the starting position of the interval, and the second one determines the end position:

{{Citation needed|date=August 2020|reason=Citation needed for the code}}

<syntaxhighlight lang="python">
n = len(S)

def search(P: str) -> tuple[int, int]:
    """
    Return indices (s, r) such that the interval A[s:r] (including the end
    index) represents all suffixes of S that start with the pattern P.
    """
    # Find starting position of interval
    l = 0  # in Python, arrays are indexed starting at 0
    r = n
    while l < r:
        mid = (l + r) // 2  # division rounding down to nearest integer
        # suffixAt(A[i]) is the ith smallest suffix
        if P > suffixAt(A[mid]):
            l = mid + 1
        else:
            r = mid
    s = l
    
    # Find ending position of interval
    r = n
    while l < r:
        mid = (l + r) // 2
        if suffixAt(A[mid]).startswith(P):
            l = mid + 1
        else:
            r = mid
    return (s, r)
</syntaxhighlight>
Finding the substring pattern <math>P</math> of length <math>m</math> in the string <math>S</math> of length <math>n</math> takes <math>\mathcal{O}(m \log n)</math> time, given that a single suffix comparison needs to compare <math>m</math> characters. {{harvtxt|Manber|Myers|1990}} describe how this bound can be improved to <math>\mathcal{O}(m + \log n)</math> time using [[LCP array|LCP]] information. The idea is that a pattern comparison does not need to re-compare certain characters, when it is already known that these are part of the longest common prefix of the pattern and the current search interval. {{harvtxt|Abouelhoda|Kurtz|Ohlebusch|2004}} improve the bound even further and achieve a search time of <math>\mathcal{O}(m)</math> for constant alphabet size, as known from [[suffix tree]]s.

Suffix sorting algorithms can be used to compute the [[Burrows–Wheeler transform|Burrows–Wheeler transform (BWT)]]. The [[Burrows–Wheeler transform|BWT]] requires sorting of all cyclic permutations of a string. If this string ends in a special end-of-string character that is lexicographically smaller than all other character (i.e., $), then the order of the sorted rotated [[Burrows–Wheeler transform|BWT]] matrix corresponds to the order of suffixes in a suffix array. The [[Burrows–Wheeler transform|BWT]] can therefore be computed in linear time by first constructing a suffix array of the text and then deducing the [[Burrows–Wheeler transform|BWT]] string: <math>BWT[i] = S[A[i]-1]</math>.

Suffix arrays can also be used to look up substrings in [[example-based machine translation]], demanding much less storage than a full [[phrase table]] as used in [[Statistical machine translation]].

Many additional applications of the suffix array require the [[LCP array]]. Some of these are detailed in the [[LCP array#Applications|application section]] of the latter.