Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Burrows–Wheeler transform
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
{{Short description|Algorithm used in data compression}} {{Infobox algorithm | name = Burrows–Wheeler transform | class = preprocessing for lossless compression | time = ''O''(''n'') | space = ''O''(''n'') | data = string }} The '''Burrows–Wheeler transform''' ('''BWT''') rearranges a [[character string (computer science)|character string]] into runs of similar characters, in a manner that can be reversed to recover the original string. Since [[data compression|compression]] techniques such as [[move-to-front transform]] and [[run-length encoding]] are more effective when such runs are present, the BWT can be used as a preparatory step to improve the efficiency of a compression algorithm, and is used this way in software such as [[bzip2]]. The algorithm can be implemented efficiently using a [[suffix array]] thus reaching linear time complexity. It was invented by [[David Wheeler (computer scientist)|David Wheeler]] in 1983, and later published by him and [[Michael Burrows (computer scientist)|Michael Burrows]] in 1994. Their paper included a compression algorithm, called the '''Block-sorting Lossless Data Compression Algorithm''' or '''BSLDCA''', that compresses data by using the BWT followed by move-to-front coding and [[Huffman coding]] or [[arithmetic coding]].<ref name=Burrows1994>{{citation | first1 = Michael | last1 = Burrows | author-link1= Michael Burrows (computer scientist) | first2 = David J. | last2 = Wheeler | author-link2= David Wheeler (British computer scientist) | title=A block sorting lossless data compression algorithm | publisher=Technical Report 124, Digital Equipment Corporation | date=May 10, 1994 | url=https://www.hpl.hp.com/techreports/Compaq-DEC/SRC-RR-124.html | url-status=dead | archiveurl=http://web.archive.org/web/20030105080431/https://www.hpl.hp.com/techreports/Compaq-DEC/SRC-RR-124.html | archivedate=January 5, 2003 }}</ref><ref name="u822">{{cite conference | last=Arnavut | first=Z. | last2=Magliveras | first2=S.S. | title=Block sorting and compression | publisher=IEEE Comput. Soc. Press | date=1997 | isbn=978-0-8186-7761-8 | doi=10.1109/DCC.1997.582009 | url=https://ieeexplore.ieee.org/document/582009/ | access-date=2025-05-07 | page=181–190}}</ref> ==Description== The transform is done by constructing a matrix (known as the Burrows-Wheeler Matrix<ref name="langmead">{{cite web |last1=Langmead |first1=Ben |title=Burrows-Wheeler Transform and FM Index |url=https://www.cs.jhu.edu/~langmea/resources/lecture_notes/bwt_and_fm_index.pdf |website=Johns Hopkins Whiting School of Engineering |access-date=23 April 2025}}</ref>) whose rows are the [[circular shift]]s of the input text, [[sorted]] in [[lexicographic order]], then taking the final column of that matrix. To allow the transform to be reversed, one additional step is necessary: either the index of the original string in the Burrows-Wheeler Matrix must be returned along with the transformed string (the approach shown in the original paper by Burrows and Wheeler<ref name=Burrows1994/>) or a special end-of-text character must be added at the start or end of the input text before the transform is executed.<ref name="langmead"/> ===Example=== Given an input string <code>S = <span style="color:red;">^</span>BANANA<span style="color:red;">$</span></code> (step 1 in the table below), rotate it ''N'' times (step 2), where <code>N = 8</code> is the length of the <code>S</code> string considering also the red <code><span style="color:red;">^</span></code> character representing the start of the string and the red <code><span style="color:red;">$</span></code> character representing the '[[End-of-file|EOF]]' pointer; these rotations, or circular shifts, are then sorted lexicographically (step 3). The output of the encoding phase is the last column <code>L = BNN<span style="color:red;">^</span>AA<span style="color:red;">$</span>A</code> after step 3, and the index (0-based) <code>I</code> of the row containing the original string <code>S</code>, in this case <code>I = 6</code>. It is not necessary to use both <code><span style="color:red;">$</span></code> and <code><span style="color:red;">^</span></code>, but at least one must be used, else we cannot invert the transform, since all circular permutations of a string have the same Burrows–Wheeler transform. {| class="wikitable" |- ! colspan="5" | Transformation |- ! 1. Input ! 2. All<br />rotations ! 3. Sort into<br />lexical order ! 4. Take the<br />last column ! 5. Output |- | align=center | <span style="color:red;">^</span>BANANA<span style="color:red;">$</span> | <span style="color:red;">^</span>BANANA<span style="color:red;">$</span> <span style="color:red;">$</span><span style="color:red;">^</span>BANANA A<span style="color:red;">$</span><span style="color:red;">^</span>BANAN NA<span style="color:red;">$</span><span style="color:red;">^</span>BANA ANA<span style="color:red;">$</span><span style="color:red;">^</span>BAN NANA<span style="color:red;">$</span><span style="color:red;">^</span>BA ANANA<span style="color:red;">$</span><span style="color:red;">^</span>B BANANA<span style="color:red;">$</span><span style="color:red;">^</span> | '''A'''NANA<span style="color:red;">$</span><span style="color:red;">^</span>B '''A'''NA<span style="color:red;">$</span><span style="color:red;">^</span>BAN '''A'''<span style="color:red;">$</span><span style="color:red;">^</span>BANAN '''B'''ANANA<span style="color:red;">$</span><span style="color:red;">^</span> '''N'''ANA<span style="color:red;">$</span><span style="color:red;">^</span>BA '''N'''A<span style="color:red;">$</span><span style="color:red;">^</span>BANA <span style="color:red;">^</span>BANANA<span style="color:red;">$</span> <span style="color:red;">'''$^'''</span>BANANA | ANANA<span style="color:red;">$</span><span style="color:red;">^</span>'''B''' ANA<span style="color:red;">$</span><span style="color:red;">^</span>BA'''N''' A<span style="color:red;">$</span><span style="color:red;">^</span>BANA'''N''' BANANA<span style="color:red;">$</span><span style="color:red;">'''^'''</span> NANA<span style="color:red;">$</span><span style="color:red;">^</span>B'''A''' NA<span style="color:red;">$</span><span style="color:red;">^</span>BAN'''A''' <span style="color:red;">^</span>BANANA<span style="color:red;">'''$'''</span> <span style="color:red;">$^</span>BANAN'''A''' | BNN<span style="color:red;">^</span>AA<span style="color:red;">$</span>A |} ===Pseudocode=== The following [[pseudocode]] gives a simple (though inefficient) way to calculate the BWT and its inverse. It assumes that the input string <code>s</code> contains a special character 'EOF' which is the last character and occurs nowhere else in the text. '''function''' BWT (''string'' s) create a table, where the rows are all possible rotations of s sort rows alphabetically '''return''' (last column of the table) '''function''' inverseBWT (''string'' s) create empty table '''repeat''' length(s) '''times''' // first insert creates first column insert s as a column of table before first column of the table sort rows of the table alphabetically '''return''' (row that ends with the 'EOF' character) ==Explanation== If the original string had several substrings that occurred often, then the BWT-transformed string will have several places where a single character is repeated many times in a row,<ref>{{cite web|url=https://github.com/adrien-mogenet/scala-bwt/blob/master/src/main/scala/me/algos/bwt/BurrowsWheelerCodec.scala|title=adrien-mogenet/scala-bwt|website=GitHub|access-date=19 April 2018}}</ref> creating more-easily-compressible data. For instance, consider transforming an English text frequently containing the word "the": For example: {| class="wikitable" ! Input | <code>THE.MAN.AND.THE.DOG.WAITED.AT.THE.STATION.FOR.THE.TRAIN.TO.THE.CITY</code> |- ! Output | <code>NDEENEEODTRNEGRWM..T.EN.HHHHHT.OTTTTTATAC.AOIATDIFOT.ASI..Y..A..I.T</code> |} Sorting the rotations of this text groups rotations starting with "he " together, and the last character of such a rotation (which is also the character before the "he ") will usually be "t" (though perhaps occasionally not, such as if the text contained "ache "), so the result of the transform will contain a run, or runs, of many consecutive "t" characters. Similarly, rotations beginning with "e " are grouped together, but "e " is often preceded by "h", so we see the output above contains a run of five consecutive "h" characters. Thus it can be seen that the success of this transform depends upon one value having a high probability of occurring before a sequence, so that in general it needs fairly long samples (a few kilobytes at least) of appropriate data (such as text). The remarkable thing about the BWT is not that it generates a more easily encoded output—an ordinary sort would do that—but that it does this ''reversibly'', allowing the original document to be re-generated from the last column data. The inverse can be understood this way. Take the final table in the BWT algorithm, and erase all but the last column. Given only this information, you can easily reconstruct the first column. The last column tells you all the characters in the text, so just sort these characters alphabetically to get the first column. Then, the last and first columns (of each row) together give you all ''pairs'' of successive characters in the document, where pairs are taken cyclically so that the last and first character form a pair. Sorting the list of pairs gives the first ''and second'' columns. To obtain the third column, the last column is again prepended to the table, and the rows are sorted lexicographically. Continuing in this manner, you can reconstruct the entire list. Then, the row with the "end of file" character at the end is the original text. Reversing the example above is done like this: {| class="wikitable" ! colspan=4 | Inverse transformation |- ! colspan=4 | Input |- | align=center colspan=4 | BNN<span style="color:red;">^</span>AA<span style="color:red;">$</span>A |- ! Add 1 ! Sort 1 ! Add 2 ! Sort 2 |- | align=right | B N N <span style="color:red;">^</span> A A <span style="color:red;">$</span> A | align=right | A A A B N N <span style="color:red;">^</span> <span style="color:red;">$</span> | align=right | BA NA NA <span style="color:red;">^</span>B AN AN <span style="color:red;">$</span><span style="color:red;">^</span> A<span style="color:red;">$</span> | align=right | AN AN A<span style="color:red;">$</span> BA NA NA <span style="color:red;">^</span>B <span style="color:red;">$</span><span style="color:red;">^</span> |- ! Add 3 ! Sort 3 ! Add 4 ! Sort 4 |- | align=right | BAN NAN NA<span style="color:red;">$</span> <span style="color:red;">^</span>BA ANA ANA <span style="color:red;">$</span><span style="color:red;">^</span>B A<span style="color:red;">$</span><span style="color:red;">^</span> | align=right | ANA ANA A<span style="color:red;">$</span><span style="color:red;">^</span> BAN NAN NA<span style="color:red;">$</span> <span style="color:red;">^</span>BA <span style="color:red;">$</span><span style="color:red;">^</span>B | align=right | BANA NANA NA<span style="color:red;">$</span><span style="color:red;">^</span> <span style="color:red;">^</span>BAN ANAN ANA<span style="color:red;">$</span> <span style="color:red;">$</span><span style="color:red;">^</span>BA A<span style="color:red;">$</span><span style="color:red;">^</span>B | align=right | ANAN ANA<span style="color:red;">$</span> A<span style="color:red;">$</span><span style="color:red;">^</span>B BANA NANA NA<span style="color:red;">$</span><span style="color:red;">^</span> <span style="color:red;">^</span>BAN <span style="color:red;">$</span><span style="color:red;">^</span>BA |- ! Add 5 ! Sort 5 ! Add 6 ! Sort 6 |- | align=right | BANAN NANA<span style="color:red;">$</span> NA<span style="color:red;">$</span><span style="color:red;">^</span>B <span style="color:red;">^</span>BANA ANANA ANA<span style="color:red;">$</span><span style="color:red;">^</span> <span style="color:red;">$</span><span style="color:red;">^</span>BAN A<span style="color:red;">$</span><span style="color:red;">^</span>BA | align=right | ANANA ANA<span style="color:red;">$</span><span style="color:red;">^</span> A<span style="color:red;">$</span><span style="color:red;">^</span>BA BANAN NANA<span style="color:red;">$</span> NA<span style="color:red;">$</span><span style="color:red;">^</span>B <span style="color:red;">^</span>BANA <span style="color:red;">$</span><span style="color:red;">^</span>BAN | align=right | BANANA NANA<span style="color:red;">$</span><span style="color:red;">^</span> NA<span style="color:red;">$</span><span style="color:red;">^</span>BA <span style="color:red;">^</span>BANAN ANANA<span style="color:red;">$</span> ANA<span style="color:red;">$</span><span style="color:red;">^</span>B <span style="color:red;">$</span><span style="color:red;">^</span>BANA A<span style="color:red;">$</span><span style="color:red;">^</span>BAN | align=right | ANANA<span style="color:red;">$</span> ANA<span style="color:red;">$</span><span style="color:red;">^</span>B A<span style="color:red;">$</span><span style="color:red;">^</span>BAN BANANA NANA<span style="color:red;">$</span><span style="color:red;">^</span> NA<span style="color:red;">$</span><span style="color:red;">^</span>BA <span style="color:red;">^</span>BANAN <span style="color:red;">$</span><span style="color:red;">^</span>BANA |- ! Add 7 ! Sort 7 ! Add 8 ! Sort 8 |- | align=right | BANANA<span style="color:red;">$</span> NANA<span style="color:red;">$</span><span style="color:red;">^</span>B NA<span style="color:red;">$</span><span style="color:red;">^</span>BAN <span style="color:red;">^</span>BANANA ANANA<span style="color:red;">$</span><span style="color:red;">^</span> ANA<span style="color:red;">$</span><span style="color:red;">^</span>BA <span style="color:red;">$</span><span style="color:red;">^</span>BANAN A<span style="color:red;">$</span><span style="color:red;">^</span>BANA | align=right | ANANA<span style="color:red;">$</span><span style="color:red;">^</span> ANA<span style="color:red;">$</span><span style="color:red;">^</span>BA A<span style="color:red;">$</span><span style="color:red;">^</span>BANA BANANA<span style="color:red;">$</span> NANA<span style="color:red;">$</span><span style="color:red;">^</span>B NA<span style="color:red;">$</span><span style="color:red;">^</span>BAN <span style="color:red;">^</span>BANANA <span style="color:red;">$</span><span style="color:red;">^</span>BANAN | align=right | BANANA<span style="color:red;">$</span><span style="color:red;">^</span> NANA<span style="color:red;">$</span><span style="color:red;">^</span>BA NA<span style="color:red;">$</span><span style="color:red;">^</span>BANA <span style="color:red;">^</span>BANANA<span style="color:red;">$</span> ANANA<span style="color:red;">$</span><span style="color:red;">^</span>B ANA<span style="color:red;">$</span><span style="color:red;">^</span>BAN <span style="color:red;">$</span><span style="color:red;">^</span>BANANA A<span style="color:red;">$</span><span style="color:red;">^</span>BANAN | align=right | ANANA<span style="color:red;">$</span><span style="color:red;">^</span>B ANA<span style="color:red;">$</span><span style="color:red;">^</span>BAN A<span style="color:red;">$</span><span style="color:red;">^</span>BANAN BANANA<span style="color:red;">$</span><span style="color:red;">^</span> NANA<span style="color:red;">$</span><span style="color:red;">^</span>BA NA<span style="color:red;">$</span><span style="color:red;">^</span>BANA <span style="color:red;">^</span>BANANA<span style="color:red;">$</span> <span style="color:red;">$</span><span style="color:red;">^</span>BANANA |- ! colspan=4 | Output |- | align=center colspan=4 | <span style="color:red;">^</span>BANANA<span style="color:red;">$</span> |} ==Optimization== A number of [[Optimization (computer science)|optimizations]] can make these algorithms run more efficiently without changing the output. There is no need to represent the table in either the encoder or decoder. In the encoder, each row of the table can be represented by a single pointer into the strings, and the sort performed using the indices. In the decoder, there is also no need to store the table, and the decoded string can be generated one character at a time from left to right. Comparative sorting can even be avoided in favor of linear sorting, with performance proportional to the alphabet size and string length. A "character" in the algorithm can be a byte, or a bit, or any other convenient size. One may also make the observation that mathematically, the encoded string can be computed as a simple modification of the [[suffix array]], and suffix arrays can be computed with linear time and memory. The BWT can be defined with regards to the suffix array SA of text T as (1-based indexing): <math display=block>BWT[i] = \begin{cases} T[SA[i]-1], & \text{if }SA[i] > 0\\ \$, & \text{otherwise}\end{cases}</math><ref>{{Cite journal|last1=Simpson|first1=Jared T.|last2=Durbin|first2=Richard|date=2010-06-15|title=Efficient construction of an assembly string graph using the FM-index|journal=Bioinformatics|language=en|volume=26|issue=12|pages=i367–i373|doi=10.1093/bioinformatics/btq217|issn=1367-4803|pmc=2881401|pmid=20529929}}</ref> There is no need to have an actual 'EOF' character. Instead, a pointer can be used that remembers where in a string the 'EOF' would be if it existed. In this approach, the output of the BWT must include both the transformed string, and the final value of the pointer. The inverse transform then shrinks it back down to the original size: it is given a string and a pointer, and returns just a string. A complete description of the algorithms can be found in Burrows and Wheeler's paper, or in a number of online sources.<ref name=Burrows1994/> The algorithms vary somewhat by whether EOF is used, and in which direction the sorting was done. In fact, the original formulation did not use an EOF marker.<ref name=Manzini>{{Cite book |chapter-url=https://people.unipmn.it/manzini/papers/mfcs99x.pdf |archive-url=https://ghostarchive.org/archive/20221009/https://people.unipmn.it/manzini/papers/mfcs99x.pdf |archive-date=2022-10-09 |url-status=live <!-- |url=https://books.google.com/books?id=OcJjpqAi15EC&pg=PA34e --> |title=Mathematical Foundations of Computer Science 1999: 24th International Symposium, MFCS'99 Szklarska Poreba, Poland, September 6-10, 1999 Proceedings |chapter=The Burrows–Wheeler Transform: Theory and Practice |last=Manzini |first=Giovanni |date=1999-08-18 |publisher=Springer Science & Business Media |isbn=9783540664086 |language=en}}</ref> ==Bijective variant== Since any rotation of the input string will lead to the same transformed string, the BWT cannot be inverted without adding an EOF marker to the end of the input or doing something equivalent, making it possible to distinguish the input string from all its rotations. Increasing the size of the alphabet (by appending the EOF character) makes later compression steps awkward. There is a [[bijective]] version of the transform, by which the transformed string uniquely identifies the original, and the two have the same length and contain exactly the same characters, just in a different order.<ref>{{citation | last1 = Gil | first1 = J. | last2 = Scott | first2 = D. A. | title = A bijective string sorting transform | url = http://bijective.dogma.net/00yyy.pdf | year = 2009 | access-date = 2009-07-09 | archive-date = 2011-10-08 | archive-url = https://web.archive.org/web/20111008001603/http://bijective.dogma.net/00yyy.pdf | url-status = dead }}</ref><ref>{{citation | last = Kufleitner | first = Manfred | editor1-last = Holub | editor1-first = Jan | editor2-last = Žďárek | editor2-first = Jan | arxiv = 0908.0239 | contribution = On bijective variants of the Burrows–Wheeler transform | pages = 65–69 | title = Prague Stringology Conference | url = http://www.stringology.org/event/2009/p07.html | year = 2009| bibcode = 2009arXiv0908.0239K}}.</ref> The bijective transform is computed by factoring the input into a non-increasing sequence of [[Lyndon word]]s; such a factorization exists and is unique by the [[Chen–Fox–Lyndon theorem]],<ref>*{{Citation | last=Lothaire | first=M. | author-link=M. Lothaire | others=Perrin, D.; Reutenauer, C.; Berstel, J.; Pin, J. E.; Pirillo, G.; Foata, D.; Sakarovitch, J.; Simon, I.; Schützenberger, M. P.; Choffrut, C.; Cori, R.; Lyndon, Roger; Rota, Gian-Carlo. Foreword by Roger Lyndon | title=Combinatorics on words | edition=2nd | series=Encyclopedia of Mathematics and Its Applications | volume=17 | publisher=[[Cambridge University Press]] | year=1997 | isbn=978-0-521-59924-5 | zbl=0874.20040 | page=67 }}</ref> and may be found in linear time and constant space.<ref>{{citation | last = Duval | first = Jean-Pierre | doi = 10.1016/0196-6774(83)90017-2 | issue = 4 | journal = Journal of Algorithms | pages = 363–381 | title = Factorizing words over an ordered alphabet | volume = 4 | year = 1983 | zbl=0532.68061| issn=0196-6774}}.</ref> The algorithm sorts the rotations of all the words; as in the Burrows–Wheeler transform, this produces a sorted sequence of ''n'' strings. The transformed string is then obtained by picking the final character of each string in this sorted list. The one important caveat here is that strings of different lengths are not ordered in the usual way; the two strings are repeated forever, and the infinite repeats are sorted. For example, "ORO" precedes "OR" because "OROORO..." precedes "OROROR...". For example, the text "<span style="color:red;">^</span>BANANA<span style="color:red;">$</span>" is transformed into "ANNBAA<span style="color:red;">^</span><span style="color:red;">$</span>" through these steps (the red <span style="color:red;">$</span> character indicates the [[End-of-file|EOF]] pointer) in the original string. The EOF character is unneeded in the bijective transform, so it is dropped during the transform and re-added to its proper place in the file. The string is broken into Lyndon words so the words in the sequence are decreasing using the comparison method above. (Note that we're sorting '<span style="color:red;">^</span>' as succeeding other characters.) "<span style="color:red;">^</span>BANANA" becomes (<span style="color:red;">^</span>) (B) (AN) (AN) (A). {| class="wikitable" |- ! colspan="5" | Bijective transformation |- ! Input ! All<br />rotations ! Sorted alphabetically ! Last column<br />of rotated Lyndon word ! Output |- | align=center | <span style="color:red;">^</span>BANANA<span style="color:red;">$</span> | '''<span style="color:red;">^</span>'''<span style="color:red;">^</span><span style="color:red;">^</span><span style="color:red;">^</span><span style="color:red;">^</span><span style="color:red;">^</span><span style="color:red;">^</span><span style="color:red;">^</span>... (<span style="color:red;">^</span>) '''B'''BBBBBBB... (B) '''ANAN'''ANAN... (AN) '''NANA'''NANA... (NA) '''ANAN'''ANAN... (AN) '''NANA'''NANA... (NA) '''A'''AAAAAAA... (A) | '''A'''AAAAAAA... (A) '''A'''NANANAN... (AN) '''A'''NANANAN... (AN) '''B'''BBBBBBB... (B) '''N'''ANANANA... (NA) '''N'''ANANANA... (NA) '''<span style="color:red;">^</span>'''<span style="color:red;">^</span><span style="color:red;">^</span><span style="color:red;">^</span><span style="color:red;">^</span><span style="color:red;">^</span><span style="color:red;">^</span><span style="color:red;">^</span>... (<span style="color:red;">^</span>) | '''A'''AAAAAAA... ('''A''') A'''N'''ANANAN... (A'''N''') A'''N'''ANANAN... (A'''N''') '''B'''BBBBBBB... ('''B''') N'''A'''NANANA... (N'''A''') N'''A'''NANANA... (N'''A''') '''<span style="color:red;">^</span>'''<span style="color:red;">^</span><span style="color:red;">^</span><span style="color:red;">^</span><span style="color:red;">^</span><span style="color:red;">^</span><span style="color:red;">^</span><span style="color:red;">^</span>... ('''<span style="color:red;">^</span>''') | ANNBAA<span style="color:red;">^</span><span style="color:red;">$</span> |} {| class="wikitable" ! colspan=4 | Inverse bijective transform |- ! colspan=4 | Input |- | align=center colspan=4 | ANNBAA<span style="color:red;">^</span> |- ! Add 1 ! Sort 1 ! Add 2 ! Sort 2 |- | align=right | A N N B A A <span style="color:red;">^</span> | align=right | A A A B N N <span style="color:red;">^</span> | align=right | AA NA NA BB AN AN <span style="color:red;">^</span><span style="color:red;">^</span> | align=right | AA AN AN BB NA NA <span style="color:red;">^</span><span style="color:red;">^</span> |- ! Add 3 ! Sort 3 ! Add 4 ! Sort 4 |- | align=right | AAA NAN NAN BBB ANA ANA <span style="color:red;">^</span><span style="color:red;">^</span><span style="color:red;">^</span> | align=right | AAA ANA ANA BBB NAN NAN <span style="color:red;">^</span><span style="color:red;">^</span><span style="color:red;">^</span> | align=right | AAAA NANA NANA BBBB ANAN ANAN <span style="color:red;">^</span><span style="color:red;">^</span><span style="color:red;">^</span><span style="color:red;">^</span> | align=right | AAAA ANAN ANAN BBBB NANA NANA <span style="color:red;">^</span><span style="color:red;">^</span><span style="color:red;">^</span><span style="color:red;">^</span> |- ! colspan=4 | Output |- | align=center colspan=4 | <span style="color:red;">^</span>BANANA |} Up until the last step, the process is identical to the inverse Burrows–Wheeler process, but here it will not necessarily give rotations of a single sequence; it instead gives rotations of Lyndon words (which will start to repeat as the process is continued). Here, we can see (repetitions of) four distinct Lyndon words: (A), (AN) (twice), (B), and (<span style="color:red;">^</span>). (NANA... doesn't represent a distinct word, as it is a cycle of ANAN....) At this point, these words are sorted into reverse order: (<span style="color:red;">^</span>), (B), (AN), (AN), (A). These are then concatenated to get :<span style="color:red;">^</span>BANANA The Burrows–Wheeler transform can indeed be viewed as a special case of this bijective transform; instead of the traditional introduction of a new letter from outside our alphabet to denote the end of the string, we can introduce a new letter that compares as preceding all existing letters that is put at the beginning of the string. The whole string is now a Lyndon word, and running it through the bijective process will therefore result in a transformed result that, when inverted, gives back the Lyndon word, with no need for reassembling at the end. For example, applying the bijective transform gives: {| class="wikitable" ! Input | <code>SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES</code> |- ! Lyndon words | <code><span style="color:#990000;">S</span><span style="color:#FF9900;">IX</span><span style="color:#006600;">.MIXED.PIXIES.SIFT.SIXTY.PIXIE</span><span style="color:#0000DD;">.DUST</span><span style="color:#660066;">.BOXES</span></code> |- ! Output | <code>STEYDST.E.IXXIIXXSMPPXS.B..EE..SUSFXDIOIIIIT</code> |} The bijective transform includes eight runs of identical characters. These runs are, in order: <code>XX</code>, <code>II</code>, <code>XX</code>, <code>PP</code>, <code>..</code>, <code>EE</code>, <code>..</code>, and <code>IIII</code>. In total, 18 characters are used in these runs. ==Dynamic Burrows–Wheeler transform== When a text is edited, its Burrows–Wheeler transform will change. Salson ''et al.''<ref name="Salson2009">{{cite journal |vauthors=Salson M, Lecroq T, Léonard M, Mouchard L |title=A Four-Stage Algorithm for Updating a Burrows–Wheeler Transform |journal=Theoretical Computer Science |year=2009 |doi=10.1016/j.tcs.2009.07.016 |volume=410 |issue=43 |pages=4350–4359|doi-access=free }}</ref> propose an algorithm that deduces the Burrows–Wheeler transform of an edited text from that of the original text, doing a limited number of local reorderings in the original Burrows–Wheeler transform, which can be faster than constructing the Burrows–Wheeler transform of the edited text directly. ==Sample implementation== This [[Python (programming language)|Python]] implementation sacrifices speed for simplicity: the program is short, but takes more than the linear time that would be desired in a practical implementation. It essentially does what the pseudocode section does. Using the [[C0 and C1 control codes#STX|STX/ETX control codes]] to mark the start and end of the text, and using <code>s[i:] + s[:i]</code> to construct the <code>i</code>th rotation of <code>s</code>, the forward transform takes the last character of each of the sorted rows: <syntaxhighlight lang="python"> from curses.ascii import STX, ETX def bwt(s: str, start=chr(STX), end=chr(ETX)) -> str: r""" Apply Burrows–Wheeler transform to input string. >>> bwt('BANANA') '\x03ANNB\x02AA' >>> bwt('BANANA', start='^', end='$') 'ANNB^AA$' >>> bwt('BANANA', start='%', end='$') 'A$NNB%AA' """ assert ( start not in s and end not in s ), "Input string cannot contain STX and ETX characters" s = f"{start}{s}{end}" # Add start and end of text marker # Table of rotations of string table = sorted(f"{s[i:]}{s[:i]}" for i, c in enumerate(s)) last_column = [row[-1:] for row in table] # Last characters of each row return "".join(last_column) # Convert list of characters into string </syntaxhighlight> The inverse transform repeatedly inserts <code>r</code> as the left column of the table and sorts the table. After the whole table is built, it returns the row that ends with ETX, minus the STX and ETX. <syntaxhighlight lang="python"> def inverse_bwt(r: str, start=chr(STX), end=chr(ETX)) -> str: r""" Apply inverse Burrows–Wheeler transform. >>> inverse_bwt('\x03ANNB\x02AA') 'BANANA' >>> inverse_bwt('ANNB^AA$', start='^', end='$') 'BANANA' >>> inverse_bwt('A$NNB%AA', start='%', end='$') 'BANANA' """ str_len = len(r) table = [""] * str_len # Make empty table for _ in range(str_len): table = sorted(rc + tc for rc, tc in zip(r, table)) # Add a column of r # Iterate over and check whether last character ends with ETX or not s = next((row for row in table if row.endswith(end)), "") # Retrieve data from array and get rid of start and end markers return s.rstrip(end).strip(start) </syntaxhighlight> Following implementation notes from Manzini, it is equivalent to use a simple [[null character]] suffix instead. The sorting should be done in [[colexicographic order]] (string read right-to-left), i.e. {{code|2=python|1=sorted(..., key=lambda s: s[::-1])}} in Python.<ref name=Manzini/> (The above control codes actually fail to satisfy EOF being the last character; the two codes are actually the ''first''. The rotation holds nevertheless.) ==BWT applications== As a [[lossless compression]] algorithm the Burrows–Wheeler transform offers the important quality that its encoding is reversible and hence the original data may be recovered from the resulting compression. The lossless quality of Burrows algorithm has provided for different algorithms with different purposes in mind. To name a few, Burrows–Wheeler transform is used in algorithms for [[sequence alignment]], [[image compression]], [[data compression]], etc. The following is a compilation of some uses given to the Burrows–Wheeler Transform. ===BWT for sequence alignment=== The advent of [[next-generation sequencing]] (NGS) techniques at the end of the 2000s decade has led to another application of the Burrows–Wheeler transformation. In NGS, [[DNA]] is fragmented into small pieces, of which the first few bases are [[DNA sequencing|sequenced]], yielding several millions of "reads", each 30 to 500 [[base pair]]s ("DNA characters") long. In many experiments, e.g., in [[ChIP-Seq]], the task is now to align these reads to a reference [[genome]], i.e., to the known, nearly complete sequence of the organism in question (which may be up to several billion base pairs long). A number of alignment programs, specialized for this task, were published, which initially relied on [[Hash function|hashing]] (e.g., [[Eland (software)|Eland]], SOAP,<ref name="Li, R2008">{{cite journal |author=Li R |title=SOAP: short oligonucleotide alignment program |journal=Bioinformatics |year=2008 |volume=24 |issue=5 |pages=713–714 |pmid=18227114 |doi=10.1093/bioinformatics/btn025|display-authors=etal|doi-access=free }}</ref> or Maq<ref name="Li, H2008">{{cite journal |vauthors=Li H, Ruan J, Durbin R |title=Mapping short DNA sequencing reads and calling variants using mapping quality scores |journal=Genome Research |volume=18 |issue=11 |pages=1851–1858 |date=2008-08-19 |pmid=18714091 |doi=10.1101/gr.078212.108 |pmc=2577856}}</ref>). In an effort to reduce the memory requirement for sequence alignment, several alignment programs were developed ([[Bowtie (sequence analysis)|Bowtie]],<ref name="Langmead2009">{{cite journal |vauthors=Langmead B, Trapnell C, Pop M, Salzberg SL |title=Ultrafast and memory-efficient alignment of short DNA sequences to the human genome |journal=Genome Biology |year=2009 |volume=10 |issue=3 |page=R25 |pmid=19261174 |doi=10.1186/gb-2009-10-3-r25 |pmc=2690996 |doi-access=free }}</ref> BWA,<ref name="Li, H2009">{{cite journal |vauthors=Li H, Durbin R |title=Fast and accurate short read alignment with Burrows–Wheeler Transform |journal=Bioinformatics |year=2009 |pmid=19451168 |volume=25 |issue=14 |pages=1754–1760 |doi=10.1093/bioinformatics/btp324 |pmc=2705234}}</ref> and SOAP2<ref name="Li, R2009">{{cite journal |author=Li R |title=SOAP2: an improved ultrafast tool for short read alignment |journal=Bioinformatics |year=2009 |pmid=19497933 |volume=25 |issue=15 |pages=1966–1967 |doi=10.1093/bioinformatics/btp336 |display-authors=etal|doi-access= }}</ref>) that use the Burrows–Wheeler transform. ===BWT for image compression=== The Burrows–Wheeler transformation has proved to be fundamental for [[image compression]] applications. For example,<ref name="Collin, P2019">{{cite book |vauthors=Collin P, Arnavut Z, Koc B |chapter=Lossless compression of medical images using Burrows–Wheeler Transformation with Inversion Coder |chapter-url=https://ieeexplore.ieee.org/document/7319012| title=2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC)|journal=<!-- --> |date=2015 |volume=2015 |pages=2956–2959 |pmid=26736912 |doi=10.1109/EMBC.2015.7319012 |isbn=978-1-4244-9271-8 |s2cid=4460328 }}</ref> Showed a compression pipeline based on the application of the Burrows–Wheeler transformation followed by inversion, run-length, and arithmetic encoders. The pipeline developed in this case is known as Burrows–Wheeler transform with an inversion encoder (BWIC). The results shown by BWIC are shown to outperform the compression performance of well-known and widely used algorithms like [[Lossless JPEG]] and [[JPEG 2000]]. BWIC is shown to outperform those in terms of final compression size of radiography medical images on the order of 5.1% and 4.1% respectively. The improvements are achieved by combining BWIC and a pre-BWIC scan of the image in a vertical snake order fashion. More recently, additional works have shown the implementation of the Burrows–Wheeler Transform in conjunction with the known [[move-to-front transform]] (MTF) achieve near lossless compression of images. <ref name="Devadoss, CP2019">{{cite journal |vauthors=Devadoss CP, Sankaragomathi B |title=Near lossless medical image compression using block BWT–MTF and hybrid fractal compression techniques |url=https://link.springer.com/article/10.1007/s10586-018-1801-3| journal=Cluster Computing| date=2019| volume=22 |pages=12929–12937 |doi=10.1007/s10586-018-1801-3|s2cid=33687086 }}</ref> ===BWT for compression of genomic databases=== Cox et al.<ref name="Cox, AJ2012">{{cite journal |vauthors= Cox AJ, Bauer MJ, Jakobi T, Rosone G|title=Large-scale compression of genomic sequence databases with the Burrows–Wheeler transform |journal=Bioinformatics |volume=28 |number=11 |pages=1415–1419 |year=2012 | publisher=Oxford University Press |doi=10.1093/bioinformatics/bts173 |pmid=22556365 |arxiv=1205.0192 }}</ref> presented a genomic compression scheme that uses BWT as the algorithm applied during the first stage of compression of several genomic datasets including the human genomic information. Their work proposed that BWT compression could be enhanced by including a second stage compression mechanism called same-as-previous encoding ("SAP"), which makes use of the fact that suffixes of two or more prefix letters could be equal. With the compression mechanism BWT-SAP, Cox et al. showed that in the genomic database ERA015743, 135.5 GB in size, the compression scheme BWT-SAP compresses the ERA015743 dataset by around 94%, to 8.2 GB. ===BWT for sequence prediction=== BWT has also been proved to be useful on sequence prediction which is a common area of study in [[machine learning]] and [[natural-language processing]]. In particular, Ktistakis et al.<ref name="Ktistakis, R2019">{{cite book |vauthors= Ktistakis R, Fournier-Viger P, Puglisi SJ, Raman R|chapter=Succinct BWT-Based Sequence Prediction |chapter-url=https://figshare.com/articles/conference_contribution/Succinct_BWT-based_Sequence_prediction/10200137| title= Database and Expert Systems Applications |series=Lecture Notes in Computer Science |date= 2019 |volume= 11707 |issue=10 |pages= 91–101 | doi=10.1007/978-3-030-27618-8_7|isbn=978-3-030-27617-1 |s2cid=201058996 }}</ref> proposed a sequence prediction scheme called SuBSeq that exploits the lossless compression of data of the Burrows–Wheeler transform. SuBSeq exploits BWT by extracting the [[FM-index]] and then performing a series of operations called backwardSearch, forwardSearch, neighbourExpansion, and getConsequents in order to search for predictions given a [[suffix]]. The predictions are then classified based on a weight and put into an array from which the element with the highest weight is given as the prediction from the SuBSeq algorithm. SuBSeq has been shown to outperform [[state of the art]] algorithms for sequence prediction both in terms of training time and accuracy. ==References== {{Reflist}} ==External links== * [http://marknelson.us/1996/09/01/bwt/ Article by Mark Nelson on the BWT] {{Webarchive|url=https://web.archive.org/web/20170325024404/http://marknelson.us/1996/09/01/bwt/ |date=2017-03-25 }} * [http://bijective.dogma.net/00yyy.pdf A Bijective String-Sorting Transform, by Gil and Scott] {{Webarchive|url=https://web.archive.org/web/20111008001603/http://bijective.dogma.net/00yyy.pdf |date=2011-10-08 }} * [https://web.archive.org/web/20170306035431/https://encode.ru/attachment.php?attachmentid=959&d=1249146089 Yuta's openbwt-v1.5.zip contains source code for various BWT routines including BWTS for bijective version] * [https://arxiv.org/abs/0908.0239 On Bijective Variants of the Burrows–Wheeler Transform, by Kufleitner] * [http://google-opensource.blogspot.com/2008/06/debuting-dcs-bwt-experimental-burrows.html Blog post] and [https://code.google.com/p/dcs-bwt-compressor/ project page] for an open-source compression program and library based on the Burrows–Wheeler algorithm * [https://www.youtube.com/watch?v=P3ORBMon8aw MIT open courseware lecture on BWT (Foundations of Computational and Systems Biology)] * [https://github.com/abderrahimh/ARahim League Table Sort (LTS) or The Weighting algorithm to BWT by Abderrahim Hechachena] {{Compression Methods}} {{DEFAULTSORT:Burrows-Wheeler Transform}} [[Category:Lossless compression algorithms]] [[Category:Data compression transforms]] [[Category:Articles with example pseudocode]] [[Category:Articles with example Python (programming language) code]] [[Category:Articles with example R code]] [[Category:Data compression]]
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)
Pages transcluded onto the current version of this page
(
help
)
:
Template:Citation
(
edit
)
Template:Cite book
(
edit
)
Template:Cite conference
(
edit
)
Template:Cite journal
(
edit
)
Template:Cite web
(
edit
)
Template:Code
(
edit
)
Template:Compression Methods
(
edit
)
Template:Infobox algorithm
(
edit
)
Template:Reflist
(
edit
)
Template:Short description
(
edit
)
Template:Webarchive
(
edit
)