Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Deflate
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Stream format == A Deflate stream consists of a series of blocks. Each block is preceded by a 3-[[bit]] header: * First bit: Last-block-in-stream marker: ** <code>1</code>: This is the last block in the stream. ** <code>0</code>: There are more blocks to process after this one. * Second and third bits: Encoding method used for this block type: ** <code>00</code>: A stored (a.k.a. raw or literal) section, between 0 and 65,535 bytes in length ** <code>01</code>: A ''static Huffman'' compressed block, using a pre-agreed [[Huffman coding|Huffman tree]] defined in the RFC ** <code>10</code>: A ''dynamic Huffman'' compressed block, complete with the Huffman table supplied ** <code>11</code>: Reserved: don't use The ''stored'' block option adds minimal overhead and is used for data that is incompressible. Most compressible data will end up being encoded using method <code>10</code>, the ''dynamic Huffman'' encoding, which produces an optimized Huffman tree customized for each block of data individually. Instructions to generate the necessary Huffman tree immediately follow the block header. The static Huffman option is used for short messages, where the fixed saving gained by omitting the tree outweighs the percentage compression loss due to using a non-optimal (thus, not technically Huffman) code. Compression is achieved through two steps: * Matching and replacing duplicate strings with pointers * Replacing symbols with new, weighted symbols based on use frequency === Duplicate string elimination === {{Further|LZ77 and LZ78|LZSS}} Within compressed blocks, if a duplicate series of bytes is spotted (a repeated string), then a back-[[Reference (computer science)|reference]] is inserted, linking to the prior location of that identical string instead. An encoded match to an earlier string consists of an [[8-bit computing|8-bit]] length (3β258 bytes) and a 15-bit distance (1β32,768 bytes) to the start of the duplicate. Relative back-references can be made across any number of blocks, as long as the distance appears within the last 32 [[Kibibyte|KiB]] of uncompressed data decoded (termed the ''sliding window''). If the distance is less than the length, the duplicate overlaps itself, indicating repetition. For example, a run of 10 identical bytes can be encoded as one byte, followed by a duplicate of length 9, starting with the prior byte. Searching the preceding text for duplicate substrings is the most computationally expensive part of the Deflate algorithm, and the operation which compression level settings affect. === Bit reduction === {{Further|Huffman coding}} The second compression stage consists of replacing commonly used symbols with shorter representations and less commonly used symbols with longer representations. The method used is [[Huffman coding]] which creates an unprefixed tree of non-overlapping intervals, where the length of each sequence is inversely proportional to the logarithm of the probability of that symbol needing to be encoded. The more likely it is that a symbol has to be encoded, the shorter its bit-sequence will be. A tree is created, containing space for 288 symbols: * 0β255: represent the literal bytes/symbols 0β255. * 256: end of block β stop processing if last block, otherwise start processing next block. * 257β285: combined with extra-bits, a match length of 3β258 bytes. * 286, 287: not used, reserved and illegal but still part of the tree. A match length code will always be followed by a distance code. Based on the distance code read, further "extra" bits may be read in order to produce the final distance. The distance tree contains space for 32 symbols: * 0β3: distances 1β4 * 4β5: distances 5β8, 1 extra bit * 6β7: distances 9β16, 2 extra bits * 8β9: distances 17β32, 3 extra bits * ... * 26β27: distances 8,193β16,384, 12 extra bits * 28β29: distances 16,385β32,768, 13 extra bits * 30β31: not used, reserved and illegal but still part of the tree For the match distance symbols 2β29, the number of extra bits can be calculated as <math>\left\lfloor\frac{n}{2}\right\rfloor-1</math>. The two codes (the 288-symbol length/literal tree and the 32-symbol distance tree) are themselves encoded as [[canonical Huffman code]]s by giving the bit length of the code for each symbol. The bit lengths are themselves [[Run-length encoding|run-length encoded]] to produce as compact a representation as possible. As an alternative to including the tree representation, the "static tree" option provides standard fixed Huffman trees. The compressed size using the static trees can be computed using the same statistics (the number of times each symbol appears) as are used to generate the dynamic trees, so it is easy for a compressor to choose whichever is smaller.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)