Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Suffix array
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Applications == The suffix array of a string can be used as an [[Index (search engine)|index]] to quickly locate every occurrence of a substring pattern <math>P</math> within the string <math>S</math>. Finding every occurrence of the pattern is equivalent to finding every suffix that begins with the substring. Thanks to the lexicographical ordering, these suffixes will be grouped together in the suffix array and can be found efficiently with two [[binary search]]es. The first search locates the starting position of the interval, and the second one determines the end position: {{Citation needed|date=August 2020|reason=Citation needed for the code}} <syntaxhighlight lang="python"> n = len(S) def search(P: str) -> tuple[int, int]: """ Return indices (s, r) such that the interval A[s:r] (including the end index) represents all suffixes of S that start with the pattern P. """ # Find starting position of interval l = 0 # in Python, arrays are indexed starting at 0 r = n while l < r: mid = (l + r) // 2 # division rounding down to nearest integer # suffixAt(A[i]) is the ith smallest suffix if P > suffixAt(A[mid]): l = mid + 1 else: r = mid s = l # Find ending position of interval r = n while l < r: mid = (l + r) // 2 if suffixAt(A[mid]).startswith(P): l = mid + 1 else: r = mid return (s, r) </syntaxhighlight> Finding the substring pattern <math>P</math> of length <math>m</math> in the string <math>S</math> of length <math>n</math> takes <math>\mathcal{O}(m \log n)</math> time, given that a single suffix comparison needs to compare <math>m</math> characters. {{harvtxt|Manber|Myers|1990}} describe how this bound can be improved to <math>\mathcal{O}(m + \log n)</math> time using [[LCP array|LCP]] information. The idea is that a pattern comparison does not need to re-compare certain characters, when it is already known that these are part of the longest common prefix of the pattern and the current search interval. {{harvtxt|Abouelhoda|Kurtz|Ohlebusch|2004}} improve the bound even further and achieve a search time of <math>\mathcal{O}(m)</math> for constant alphabet size, as known from [[suffix tree]]s. Suffix sorting algorithms can be used to compute the [[Burrows–Wheeler transform|Burrows–Wheeler transform (BWT)]]. The [[Burrows–Wheeler transform|BWT]] requires sorting of all cyclic permutations of a string. If this string ends in a special end-of-string character that is lexicographically smaller than all other character (i.e., $), then the order of the sorted rotated [[Burrows–Wheeler transform|BWT]] matrix corresponds to the order of suffixes in a suffix array. The [[Burrows–Wheeler transform|BWT]] can therefore be computed in linear time by first constructing a suffix array of the text and then deducing the [[Burrows–Wheeler transform|BWT]] string: <math>BWT[i] = S[A[i]-1]</math>. Suffix arrays can also be used to look up substrings in [[example-based machine translation]], demanding much less storage than a full [[phrase table]] as used in [[Statistical machine translation]]. Many additional applications of the suffix array require the [[LCP array]]. Some of these are detailed in the [[LCP array#Applications|application section]] of the latter.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)