Editing LZMA (section)

== Decompression algorithm details ==
{{Original research|section|date=April 2012}}
No complete natural language specification of the compressed format seems to exist, other than the one attempted in the following text.

The description below is based on the compact [[XZ Utils|XZ]] Embedded decoder by Lasse Collin included in the Linux kernel source<ref>{{cite web
| url=https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/lib/xz/xz_dec_lzma2.c
| title=lib/xz/xz_dec_lzma2.c
| first1=Lasse
| last1=Collin
| first2=Igor
| last2=Pavlov
| access-date=2013-06-16}}</ref> from which the LZMA and LZMA2 algorithm details can be relatively easily deduced: thus, while citing source code as reference is not ideal, any programmer should be able to check the claims below with a few hours of work.

===Range coding of bits===

LZMA data is at the lowest level decoded one bit at a time by the range decoder, at the direction of the LZMA decoder.

Context-based range decoding is invoked by the LZMA algorithm passing it a reference to the "context", which consists of the unsigned 11-bit variable ''prob'' (typically implemented using a 16-bit data type) representing the predicted probability of the bit being 0, which is read and updated by the range decoder (and should be initialized to {{tmath|2^{10} }}, representing 0.5 probability).

Fixed probability range decoding instead assumes a 0.5 probability, but operates slightly differently from context-based range decoding.

The range decoder state consists of two unsigned 32-bit variables, ''range'' (representing the range size), and ''code'' (representing the encoded point within the range).

Initialization of the range decoder consists of setting ''range'' to {{math|2<sup>32</sup> − 1}}, and ''code'' to the 32-bit value starting at the second byte in the stream interpreted as big-endian; the first byte in the stream is completely ignored.

Normalization proceeds in this way:
# Shift both ''range'' and ''code'' left by 8 bits
# Read a byte from the compressed stream
# Set the least significant 8 bits of ''code'' to the byte value read

Context-based range decoding of a bit using the ''prob'' probability variable proceeds in this way:
# If ''range'' is less than {{tmath|2^{24} }}, perform normalization
# Set ''bound'' to {{tmath|\lfloor range / 2^{11} \rfloor \times prob}}
# If ''code'' is less than ''bound'':
## Set ''range'' to ''bound''
## Set ''prob'' to ''prob'' + {{tmath|\lfloor 2^{11} - prob \rfloor / 2^5}}
## Return bit 0
# Otherwise (if ''code'' is greater than or equal to the ''bound''):
## Set ''range'' to ''range'' &minus; ''bound''
## Set ''code'' to ''code'' &minus; ''bound''
## Set ''prob'' to {{tmath|prob - \lfloor prob / 2^5 \rfloor }}
## Return bit 1

Fixed-probability range decoding of a bit proceeds in this way:
# If ''range'' is less than {{tmath|2^{24} }}, perform normalization
# Set ''range'' to {{tmath|\lfloor range / 2 \rfloor}}
# If ''code'' is less than ''range'':
## Return bit 0
# Otherwise (if ''code'' is greater or equal than ''range''):
## Set ''code'' to ''code'' &minus; ''range''
## Return bit 1

The Linux kernel implementation of fixed-probability decoding in <code>rc_direct()</code>, for performance reasons, does not include a conditional branch, but instead subtracts ''range'' from ''code'' unconditionally. The resulting sign bit is used to both decide the bit to return and to generate a mask that is combined with ''code'' and added to ''range''.

Note that:
# The division by {{tmath|2^{11} }} when computing ''bound'' and floor operation is done before the multiplication, not after (apparently to avoid requiring fast hardware support for 32-bit multiplication with a 64-bit result)
# Fixed probability decoding is not strictly equivalent to context-based range decoding with any ''prob'' value, due to the fact that context-based range decoding discards the lower 11 bits of ''range'' before multiplying by ''prob'' as just described, while fixed probability decoding only discards the last bit

===Range coding of integers ===

The range decoder also provides the bit-tree, reverse bit-tree and fixed probability integer decoding facilities, which are used to decode integers, and generalize the single-bit decoding described above.
To decode unsigned integers less than ''limit'', an array of {{math|(''limit'' − 1)}} 11-bit probability variables is provided, which are conceptually arranged as the internal nodes of a complete binary tree with ''limit'' leaves.

Non-reverse bit-tree decoding works by keeping a pointer to the tree of variables, which starts at the root. As long as the pointer does not point to a leaf, a bit is decoded using the variable indicated by the pointer, and the pointer is moved to either the left or right children depending on whether the bit is 0 or 1; when the pointer points to a leaf, the number associated with the leaf is returned.

Non-reverse bit-tree decoding thus happens from most significant to least significant bit, stopping when only one value in the valid range is possible (this conceptually allows to have range sizes that are not powers of two, even though LZMA does not make use of this).

Reverse bit-tree decoding instead decodes from least significant bit to most significant bits, and thus only supports ranges that are powers of two, and always decodes the same number of bits. It is equivalent to performing non-reverse bittree decoding with a power of two ''limit'', and reversing the last {{math|log{{sub|2}}(''limit'')}} bits of the result.

In the {{mono|rc_bittree}} function in the Linux kernel, integers are actually returned in the {{math|[''limit'', 2 × ''limit'')}} range (with ''limit'' added to the conceptual value), and the variable at index 0 in the array is unused, while the one at index 1 is the root, and the left and right children indices are computed as 2''i'' and 2''i'' + 1. The {{mono|rc_bittree_reverse}} function instead adds integers in the {{math|[0, ''limit'')}} range to a caller-provided variable, where ''limit'' is implicitly represented by its logarithm, and has its own independent implementation for efficiency reasons.

Fixed probability integer decoding simply performs fixed probability bit decoding repeatedly, reading bits from the most to the least significant.

=== LZMA configuration ===
The LZMA decoder is configured by an {{mono|''lclppb''}} "properties" byte and a dictionary size. The value of the {{mono|''lclppb''}} byte is <code>''lc'' + ''lp'' * 9 + ''pb'' * 9 * 5</code>, where:
* {{mono|''lc''}} is the number of high bits of the previous byte to use as a context for literal encoding (the default value used by the LZMA SDK is 3)
* {{mono|''lp''}} is the number of low bits of the dictionary position to include in {{mono|''literal_pos_state''}} (the default value used by the LZMA SDK is 0)
* {{mono|''pb''}} is the number of low bits of the dictionary position to include in {{mono|''pos_state''}} (the default value used by the LZMA SDK is 2)

In non-LZMA2 streams, {{mono|''lc''}} must not be greater than 8, and {{mono|''lp''}} and {{mono|''pb''}} must not be greater than 4.
In LZMA2 streams, <code>''lc'' + ''lp''</code> and {{mono|''pb''}} must not be greater than 4.

In the 7-zip LZMA file format, configuration is performed by a header containing the "properties" byte followed by the 32-bit little-endian dictionary size in bytes. In LZMA2, the properties byte can optionally be changed at the start of LZMA2 LZMA packets, while the dictionary size is specified in the LZMA2 header as later described.

=== LZMA coding contexts ===

The LZMA packet format has already been described, and this section specifies how LZMA statistically models the LZ-encoded streams, or in other words which probability variables are passed to the range decoder to decode each bit.

Those probability variables are implemented as multi-dimensional arrays; before introducing them, a few values that are used as indices in these multidimensional arrays are defined.

The {{mono|''state''}} value is conceptually based on which of the patterns in the following table match the latest 2–4 packet types seen, and is implemented as a state machine state updated according to the transition table listed in the table every time a packet is output.

The initial state is 0, and thus packets before the beginning are assumed to be LIT packets.
{| class="wikitable" border="1"
|-
! state
! colspan="4" | previous packets
! colspan="4" | next state when next packet is
|-
!
! 4th previous
! 3rd previous
! 2nd previous
! previous
! LIT
! MATCH
! LONGREP[*]
! SHORTREP
|-
| 0
|
| LIT
| LIT
| LIT
| 0
| 7
| 8
| 9
|-
| 1
|
| MATCH
| LIT
| LIT
| 0
| 7
| 8
| 9
|-
| rowspan="2" | 2
|
| LONGREP[*]
| rowspan="2" | LIT
| rowspan="2" | LIT
| rowspan="2" | 0
| rowspan="2" | 7
| rowspan="2" | 8
| rowspan="2" | 9
|-
| *MATCH
| SHORTREP
|-
| 3
| LIT
| SHORTREP
| LIT
| LIT
| 0
| 7
| 8
| 9
|-
| 4
|
|
| MATCH
| LIT
| 1
| 7
| 8
| 9
|-
| rowspan="2" | 5
| rowspan="2" |
|
| LONGREP[*]
| rowspan="2" | LIT
| rowspan="2" | 2
| rowspan="2" | 7
| rowspan="2" | 8
| rowspan="2" | 9
|-
| *MATCH
| SHORTREP
|-
| 6
|
| LIT
| SHORTREP
| LIT
| 3
| 7
| 8
| 9
|-
| 7
|
|
| LIT
| MATCH
| 4
| 10
| 11
| 11
|-
| 8
|
|
| LIT
| LONGREP[*]
| 5
| 10
| 11
| 11
|-
| 9
|
|
| LIT
| SHORTREP
| 6
| 10
| 11
| 11
|-
| 10
|
|
| *MATCH
| MATCH
| 4
| 10
| 11
| 11
|-
| 11
|
|
| *MATCH
| *REP
| 5
| 10
| 11
| 11
|}

The {{mono|''pos_state''}} and {{mono|''literal_pos_state''}} values consist of respectively the {{mono|''pb''}} and {{mono|''lp''}} (up to 4, from the LZMA header or LZMA2 properties packet) least significant bits of the dictionary position (the number of bytes coded since the last dictionary reset modulo the dictionary size). Note that the dictionary size is normally the multiple of a large power of 2, so these values are equivalently described as the least significant bits of the number of uncompressed bytes seen since the last dictionary reset.

The {{mono|''prev_byte_lc_msbs''}} value is set to the {{mono|''lc''}} (up to 4, from the LZMA header or LZMA2 properties packet) most significant bits of the previous uncompressed byte.

The {{mono|''is_REP''}} value denotes whether a packet that includes a length is a LONGREP rather than a MATCH.

The {{mono|''match_byte''}} value is the byte that would have been decoded if a SHORTREP packet had been used (in other words, the byte found at the dictionary at the last used distance); it is only used just after a *MATCH packet.

{{mono|''literal_bit_mode''}} is an array of 8 values in the 0–2 range, one for each bit position in a byte, which are 1 or 2 if the previous packet was a *MATCH and it is either the most significant bit position or all the more significant bits in the literal to encode/decode are equal to the bits in the corresponding positions in {{mono|''match_byte''}}, while otherwise it is 0; the choice between the 1 or 2 values depends on the value of the bit at the same position in {{mono|''match_byte''}}.

The literal/Literal set of variables can be seen as a "pseudo-bit-tree" similar to a bit-tree but with 3 variables instead of 1 in every node, chosen depending on the {{mono|''literal_bit_mode''}} value at the bit position of the next bit to decode after the bit-tree context denoted by the node.

The claim, found in some sources, that literals after a *MATCH are coded as the XOR of the byte value with {{mono|''match_byte''}} is incorrect; they are instead coded simply as their byte value, but using the pseudo-bit-tree just described and the additional context listed in the table below.

The probability variable groups used in LZMA are those:
{| class="wikitable" border="1"
|-
! XZ name
! LZMA SDK name
! Parameterized by
! Used when
! Coding mode
! If bit 0 then
! If bit 1 then
|-
| {{mono|is_match}}
| {{mono|IsMatch}}
| {{mono|''state''}}, {{mono|''pos_state''}}
| packet start
| bit
| LIT
| *MATCH
|-
| {{mono|is_rep}}
| {{mono|IsRep}}
| {{mono|''state''}}
| after bit sequence 1
| bit
| MATCH
| *REP
|-
| {{mono|is_rep0}}
| {{mono|IsRepG0}}
| {{mono|''state''}}
| after bit sequence 11
| bit
| SHORTREP/{{nowrap|LONGREP[0]}}
| {{nowrap|LONGREP[1–3]}}
|-
| {{mono|is_rep0_long}}
| {{mono|IsRep0Long}}
| {{mono|''state''}}, {{mono|''pos_state''}}
| after bit sequence 110
| bit
| SHORTREP
| LONGREP[0]
|-
| {{mono|is_rep1}}
| {{mono|IsRepG1}}
| {{mono|''state''}}
| after bit sequence 111
| bit
| LONGREP[1]
| LONGREP[2/3]
|-
| {{mono|is_rep2}}
| {{mono|IsRepG2}}
| {{mono|''state''}}
| after bit sequence 1111
| bit
| LONGREP[2]
| LONGREP[3]
|-
| {{mono|literal}}
| {{mono|Literal}}
| {{mono|''prev_byte_lc_msbs''}}, {{mono|''literal_pos_state''}}, {{nowrap|<code>''literal_bit_mode''[bit position]</code>}}, bit-tree context
| after bit sequence 0
| 256 values pseudo-bit-tree
| colspan="2" | literal byte value
|-
| {{mono|dist_slot}}
| {{mono|PosSlot}}
| <code>min(''match_length'', 5)</code>, bit-tree context
| distance: start
| 64 values bit-tree
| colspan="2" | distance slot
|-
| {{mono|dist_special}}
| {{mono|SpecPos}}
| {{mono|''distance_slot''}}, reverse bit-tree context
| distance: 4–13 distance slots
| {{nowrap|<code>((''distance_slot'' >> 1) − 1)</code>}}-bit reverse bit-tree
| colspan="2" | low bits of distance
|-
| {{mono|dist_align}}
| {{mono|Align}}
| reverse bit-tree context
| distance: 14+ distance slots, after fixed probability bits
| 4-bit reverse bit-tree
| colspan="2" | low bits of distance
|-
| {{mono|len_dec.choice}}
| {{mono|LenChoice}}
| {{mono|''is_REP''}}
| match length: start
| bit
| 2–9 length
| 10+ length
|-
| {{mono|len_dec.choice2}}
| {{mono|LenChoice2}}
| {{mono|''is_REP''}}
| match length: after bit sequence 1
| bit
| 10–17 length
| 18+ length
|-
| {{mono|len_dec.low}}
| {{mono|LenLow}}
| {{mono|''is_REP''}}, {{mono|''pos_state''}}, bit-tree context
| match length: after bit sequence 0
| 8 values bit-tree
| colspan="2" | low bits of length
|-
| {{mono|len_dec.mid}}
| {{mono|LenMid}}
| {{mono|''is_REP''}}, {{mono|''pos_state''}}, bit-tree context
| match length: after bit sequence 10
| 8 values bit-tree
| colspan="2" | middle bits of length
|-
| {{mono|len_dec.high}}
| {{mono|LenHigh}}
| {{mono|''is_REP''}}, bit-tree context
| match length: after bit sequence 11
| 256 values bit-tree
| colspan="2" | high bits of length
|}

=== LZMA2 format ===
The LZMA2 container supports multiple runs of compressed LZMA data and uncompressed data. Each LZMA compressed run can have a different LZMA configuration and dictionary. This improves the compression of partially or completely incompressible files and allows multithreaded compression and multithreaded decompression by breaking the file into runs that can be compressed or decompressed independently in parallel.
Criticism of changes in LZMA2 over LZMA include header fields not being covered by CRCs,
and parallel decompression not being possible in practice.<ref name=xz-inadequate />

The LZMA2 header consists of a byte indicating the dictionary size:
* 40 indicates a 4 GB − 1 dictionary size
* Even values less than 40 indicate a 2<sup>''v''/2 + 12</sup> bytes dictionary size
* Odd values less than 40 indicate a 3×2<sup>(''v'' − 1)/2 + 11</sup> bytes dictionary size
* Values higher than 40 are invalid

LZMA2 data consists of packets starting with a control byte, with the following values:
* 0 denotes the end of the file
* 1 denotes a dictionary reset followed by an uncompressed chunk
* 2 denotes an uncompressed chunk without a dictionary reset
* 3–0x7f are invalid values
* 0x80–0xff denotes an LZMA chunk, where the lowest 5 bits are used as bit 16–20 of the uncompressed size minus one, and bit 5–6 indicates what should be reset

Bits 5–6 for LZMA chunks can be:
* 0: nothing reset
* 1: state reset
* 2: state reset, properties reset using properties byte
* 3: state reset,  properties reset using properties byte, dictionary reset

LZMA state resets cause a reset of all LZMA state except the dictionary, and specifically:
* The range coder
* The ''state'' value
* The last distances for repeated matches
* All LZMA probabilities

Uncompressed chunks consist of:
* A 16-bit big-endian value encoding the data size minus one
* The data to be copied verbatim into the dictionary and the output

LZMA chunks consist of:
* A 16-bit big-endian value encoding the low 16 bits of the uncompressed size minus one
* A 16-bit big-endian value encoding the compressed size minus one
* A properties/lclppb byte if bit 6 in the control byte is set
* The LZMA compressed data, starting with the 5 bytes (of which the first is ignored) used to initialize the range coder (which are included in the compressed size)

=== xz and 7z formats ===

The .[[XZ Utils|xz]] format, which can contain LZMA2 data, is documented at ''tukaani.org'',<ref>{{cite web
| url=http://tukaani.org/xz/xz-file-format.txt
| title=The .xz File Format
| date=2009-08-27
| access-date=2013-06-16}}</ref> while the .7z file format, which can contain either LZMA or LZMA2 data, is documented in the 7zformat.txt file contained in the LZMA SDK.<ref name="LZMA_SDK">{{cite web
| url=http://7-zip.org/sdk.html
| title=LZMA SDK (Software Development Kit)
| year=2013
| author=Igor Pavlov
| access-date=2013-06-16
}}</ref>