Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Suffix array
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Construction algorithms == A suffix tree can be built in <math>\mathcal{O}(n)</math> and can be converted into a suffix array by traversing the tree depth-first also in <math>\mathcal{O}(n)</math>, so there exist algorithms that can build a suffix array in <math>\mathcal{O}(n)</math>. A naive approach to construct a suffix array is to use a [[Comparison sort|comparison-based sorting algorithm]]. These algorithms require <math>\mathcal{O}(n \log n)</math> suffix comparisons, but a suffix comparison runs in <math>\mathcal{O}(n)</math> time, so the overall runtime of this approach is <math>\mathcal{O}(n^2 \log n)</math>. More advanced algorithms take advantage of the fact that the suffixes to be sorted are not arbitrary strings but related to each other. These algorithms strive to achieve the following goals:{{sfn|Puglisi|Smyth|Turpin|2007}} * minimal asymptotic complexity <math>\Theta(n)</math> * lightweight in space, meaning little or no working memory beside the text and the suffix array itself is needed * fast in practice One of the first algorithms to achieve all goals is the SA-IS algorithm of {{harvtxt|Nong|Zhang|Chan|2009}}. The algorithm is also rather simple (< 100 [[Source lines of code|LOC]]) and can be enhanced to simultaneously construct the [[LCP array]].{{sfn|Fischer|2011}} The SA-IS algorithm is one of the fastest known suffix array construction algorithms. A careful implementation by Yuta Mori<ref>{{Cite web |last=Mori |first=Yuta |title=sais |url=https://sites.google.com/site/yuta256/sais |url-status=dead |archive-url=https://web.archive.org/web/20230309123010/https://sites.google.com/site/yuta256/sais |archive-date=9 Mar 2023 |access-date=31 Aug 2023}}</ref> outperforms most other linear or super-linear construction approaches. Beside time and space requirements, suffix array construction algorithms are also differentiated by their supported [[Alphabet (computer science)|alphabet]]: ''constant alphabets'' where the alphabet size is bound by a constant, ''integer alphabets'' where characters are integers in a range depending on <math>n</math> and ''general alphabets'' where only character comparisons are allowed.{{sfn|Burkhardt|Kärkkäinen|2003}} Most suffix array construction algorithms are based on one of the following approaches:{{sfn|Puglisi|Smyth|Turpin|2007}} * ''Prefix doubling'' algorithms are based on a strategy of {{harvtxt|Karp|Miller|Rosenberg|1972}}. The idea is to find prefixes that honor the lexicographic ordering of suffixes. The assessed prefix length doubles in each iteration of the algorithm until a prefix is unique and provides the rank of the associated suffix. * ''Recursive'' algorithms follow the approach of the suffix tree construction algorithm by {{harvtxt|Farach|1997}} to recursively sort a subset of suffixes. This subset is then used to infer a suffix array of the remaining suffixes. Both of these suffix arrays are then merged to compute the final suffix array. * ''Induced copying'' algorithms are similar to recursive algorithms in the sense that they use an already sorted subset to induce a fast sort of the remaining suffixes. The difference is that these algorithms favor iteration over recursion to sort the selected suffix subset. A survey of this diverse group of algorithms has been put together by {{harvtxt|Puglisi|Smyth|Turpin|2007}}. A well-known recursive algorithm for integer alphabets is the ''DC3 / skew'' algorithm of {{harvtxt|Kärkkäinen|Sanders|2003}}. It runs in linear time and has successfully been used as the basis for parallel{{sfn|Kulla|Sanders|2007}} and [[External memory algorithm|external memory]]{{sfn|Dementiev|Kärkkäinen|Mehnert|Sanders|2008}} suffix array construction algorithms. Recent work by {{harvtxt|Salson|Lecroq|Léonard|Mouchard|2010}} proposes an algorithm for updating the suffix array of a text that has been edited instead of rebuilding a new suffix array from scratch. Even if the theoretical worst-case time complexity is <math>\mathcal{O}(n \log n)</math>, it appears to perform well in practice: experimental results from the authors showed that their implementation of dynamic suffix arrays is generally more efficient than rebuilding when considering the insertion of a reasonable number of letters in the original text. In practical [[open source]] work, a commonly used routine for suffix array construction was qsufsort, based on the 1999 Larsson-Sadakane algorithm.<ref>{{cite journal |last1=Larsson |first1=N. Jesper |last2=Sadakane |first2=Kunihiko |title=Faster suffix sorting |journal=Theoretical Computer Science |date=22 November 2007 |volume=387 |issue=3 |pages=258–272 |doi=10.1016/j.tcs.2007.07.017 |language=en |issn=0304-3975|doi-access=free }}</ref> This routine has been superseded by Yuta Mori's DivSufSort, "the fastest known suffix sorting algorithm in main memory" as of 2017. It too can be modified to compute an LCP array. It uses a induced copying combined with Itoh-Tanaka.<ref>{{cite journal |last1=Fischer |first1=Johannes |last2=Kurpicz |first2=Florian |title=Dismantling DivSufSort |arxiv=1710.01896 |date=5 October 2017 |journal=Proceedings of the Prague Stringology Conference 2017}}</ref> In 2021 a faster implementation of the algorithm was presented by Ilya Grebnov <ref>{{Cite web|title=New saca and bwt library (libsais)|url=https://encode.su/threads/3579-New-saca-and-bwt-library-(libsais)|access-date=2021-10-03|website=encode.su}}</ref> which in average showed 65% performance improvement over DivSufSort implementation on the [[Silesia corpus]].<ref>{{Citation|last=Grebnov|first=Ilya|title=libsais|date=2021-09-22|url=https://github.com/IlyaGrebnov/libsais|access-date=2021-10-02}}</ref>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)