Editing Deflate (section)

== Stream format ==
A Deflate stream consists of a series of blocks. Each block is preceded by a 3-[[bit]] header:
* First bit: Last-block-in-stream marker:
** <code>1</code>: This is the last block in the stream.
** <code>0</code>: There are more blocks to process after this one.
* Second and third bits: Encoding method used for this block type:
** <code>00</code>: A stored (a.k.a. raw or literal) section, between 0 and 65,535 bytes in length
** <code>01</code>: A ''static Huffman'' compressed block, using a pre-agreed [[Huffman coding|Huffman tree]] defined in the RFC
** <code>10</code>: A ''dynamic Huffman'' compressed block, complete with the Huffman table supplied
** <code>11</code>: Reserved: don't use

The ''stored'' block option adds minimal overhead and is used for data that is incompressible.

Most compressible data will end up being encoded using method <code>10</code>, the ''dynamic Huffman'' encoding, which produces an optimized Huffman tree customized for each block of data individually. Instructions to generate the necessary Huffman tree immediately follow the block header. The static Huffman option is used for short messages, where the fixed saving gained by omitting the tree outweighs the percentage compression loss due to using a non-optimal (thus, not technically Huffman) code.

Compression is achieved through two steps:
* Matching and replacing duplicate strings with pointers
* Replacing symbols with new, weighted symbols based on use frequency

=== Duplicate string elimination ===
{{Further|LZ77 and LZ78|LZSS}}
Within compressed blocks, if a duplicate series of bytes is spotted (a repeated string), then a back-[[Reference (computer science)|reference]] is inserted, linking to the prior location of that identical string instead. An encoded match to an earlier string consists of an [[8-bit computing|8-bit]] length (3–258 bytes) and a 15-bit distance (1–32,768 bytes) to the start of the duplicate. Relative back-references can be made across any number of blocks, as long as the distance appears within the last 32&nbsp;[[Kibibyte|KiB]] of uncompressed data decoded (termed the ''sliding window'').

If the distance is less than the length, the duplicate overlaps itself, indicating repetition. For example, a run of 10 identical bytes can be encoded as one byte, followed by a duplicate of length 9, starting with the prior byte.

Searching the preceding text for duplicate substrings is the most computationally expensive part of the Deflate algorithm, and the operation which compression level settings affect.

=== Bit reduction ===
{{Further|Huffman coding}}
The second compression stage consists of replacing commonly used symbols with shorter representations and less commonly used symbols with longer representations. The method used is [[Huffman coding]] which creates an unprefixed tree of non-overlapping intervals, where the length of each sequence is inversely proportional to the logarithm of the probability of that symbol needing to be encoded. The more likely it is that a symbol has to be encoded, the shorter its bit-sequence will be.

A tree is created, containing space for 288 symbols:
* 0–255: represent the literal bytes/symbols 0–255.
* 256: end of block – stop processing if last block, otherwise start processing next block.
* 257–285: combined with extra-bits, a match length of 3–258 bytes.
* 286, 287: not used, reserved and illegal but still part of the tree.

A match length code will always be followed by a distance code. Based on the distance code read, further "extra" bits may be read in order to produce the final distance. The distance tree contains space for 32 symbols:
* 0–3: distances 1–4
* 4–5: distances 5–8, 1 extra bit
* 6–7: distances 9–16, 2 extra bits
* 8–9: distances 17–32, 3 extra bits
* ...
* 26–27: distances 8,193–16,384, 12 extra bits
* 28–29: distances 16,385–32,768, 13 extra bits
* 30–31: not used, reserved and illegal but still part of the tree

For the match distance symbols 2–29, the number of extra bits can be calculated as <math>\left\lfloor\frac{n}{2}\right\rfloor-1</math>.

The two codes (the 288-symbol length/literal tree and the 32-symbol distance tree) are themselves encoded as [[canonical Huffman code]]s by giving the bit length of the code for each symbol. The bit lengths are themselves [[Run-length encoding|run-length encoded]] to produce as compact a representation as possible. As an alternative to including the tree representation, the "static tree" option provides standard fixed Huffman trees. The compressed size using the static trees can be computed using the same statistics (the number of times each symbol appears) as are used to generate the dynamic trees, so it is easy for a compressor to choose whichever is smaller.