Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Bzip2
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== File format == No formal specification for bzip2 exists, although an informal specification has been reverse engineered from the reference implementation.<ref>{{cite web | url=https://github.com/dsnet/compress/blob/master/doc/bzip2-format.pdf | title=BZIP2 Format Specification | website=[[GitHub]] | date=17 March 2022 }}</ref> As an overview, a <code>.bz2</code> stream consists of a 4-byte header, followed by zero or more compressed blocks, immediately followed by an end-of-stream marker containing a 32-bit CRC for the plaintext whole stream processed. The compressed blocks are bit-aligned and no padding occurs. <!-- /* Paul Sladen, 2007-01-11 */ --><pre> .magic:16 = 'BZ' signature/magic number .version:8 = 'h' for Bzip2 ('H'uffman coding), '0' for Bzip1 (deprecated) .hundred_k_blocksize:8 = '1'..'9' block-size 100 kB-900 kB (uncompressed) .compressed_magic:48 = 0x314159265359 (BCD (pi)) .crc:32 = checksum for this block .randomised:1 = 0=>normal, 1=>randomised (deprecated) .origPtr:24 = starting pointer into BWT for after untransform .huffman_used_map:16 = bitmap, of ranges of 16 bytes, present/not present .huffman_used_bitmaps:0..256 = bitmap, of symbols used, present/not present (multiples of 16) .huffman_groups:3 = 2..6 number of different Huffman tables in use .selectors_used:15 = number of times that the Huffman tables are swapped (each 50 symbols) *.selector_list:1..6 = zero-terminated bit runs (0..62) of MTF'ed Huffman table (*selectors_used) .start_huffman_length:5 = 0..20 starting bit length for Huffman deltas *.delta_bit_length:1..40 = 0=>next symbol; 1=>alter length { 1=>decrement length; 0=>increment length } (*(symbols+2)*groups) .contents:2..β = Huffman encoded data stream until end of block (max. 7372800 bit) .eos_magic:48 = 0x177245385090 (BCD sqrt(pi)) .crc:32 = checksum for whole stream .padding:0..7 = align to whole byte </pre> Because of the first-stage RLE compression (see above), the maximum length of plaintext that a single 900 kB bzip2 block can contain is around 46 MB (45,899,236 bytes). This can occur if the whole plaintext consists entirely of repeated values (the resulting <code>.bz2</code> file in this case is 46 bytes long). An even smaller file of 40 bytes can be achieved by using an input containing entirely values of 251, an apparent compression ratio of 1147480.9:1. A compressed block in bzip2 can be decompressed without having to process earlier blocks. This means that bzip2 files can be decompressed in parallel, making it a good format for use in [[big data]] applications with cluster computing frameworks like [[Hadoop]] and [[Apache Spark]].<ref>{{cite web | url=https://issues.apache.org/jira/browse/HADOOP-4012 | title=[HADOOP-4012] Providing splitting support for bzip2 compressed files | work=[[Apache Software Foundation]] | date=2009 | access-date=2015-10-14 }}</ref>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)