Editing Arithmetic coding (section)

==Implementation details and examples==
[[File:arithmetic_coding_visualisation.svg|thumb|300px|Encoding the message "WIKI" with arithmetic coding
{|
! style="text-align:right; vertical-align:top;"|1.
| The letter frequencies are found.
|-
! style="text-align:right; vertical-align:top;"|2.
| The interval [0, 1) is partitioned in the ratio of the frequencies.
|-
! style="text-align:right; vertical-align:top;"|3–5.
| The corresponding interval is iteratively partitioned for each letter in the message.
|-
! style="text-align:right; vertical-align:top;"|6.
| Any value in the final interval is chosen to represent the message.
|-
! style="text-align:right; vertical-align:top;"|2*–6*.
| The partitioning and value if the message were "KIWI" instead.
|} ]]
[[File:Arithmetic_coding_visualisation_circle.svg|thumb|300px|The above example visualised as a circle, the values in red encoding "WIKI" and "KIWI" – in [http://upload.wikimedia.org/wikipedia/commons/8/81/Arithmetic_coding_visualisation_circle.svg the SVG image], hover over an interval to highlight it and show its statistics]]

===Equal probabilities===
In the simplest case, the probability of each symbol occurring is equal. For example, consider a set of three symbols, A, B, and C, each equally likely to occur. Encoding the symbols one by one would require 2 bits per symbol, which is wasteful: one of the bit variations is never used. That is to say, symbols A, B and C might be encoded respectively as 00, 01 and 10, with 11 unused.

A more efficient solution is to represent a sequence of these three symbols as a rational number in base 3 where each digit represents a symbol. For example, the sequence "ABBCAB" could become 0.011201<sub>3</sub>, in arithmetic coding as a value in the interval &#91;0,&nbsp;1). The next step is to encode this [[Ternary numeral system|ternary]] number using a fixed-point binary number of sufficient precision to recover it, such as 0.0010110001<sub>2</sub> – this is only 10 bits; 2 bits are saved in comparison with naïve block encoding. This is feasible for long sequences because there are efficient, in-place algorithms for converting the base of arbitrarily precise numbers.

To decode the value, knowing the original string had length&nbsp;6, one can simply convert back to base&nbsp;3, round to 6 digits, and recover the string.

===Defining a model===

In general, arithmetic coders can produce near-optimal output for any given set of symbols and probabilities. (The optimal value is −log<sub>2</sub>''P'' bits for each symbol of probability ''P''; see ''[[Source coding theorem]]''.) Compression algorithms that use arithmetic coding start by determining a [[model (abstract)|model]] of the data – basically a prediction of what patterns will be found in the symbols of the message. The more accurate this prediction is, the closer to optimal the output will be.

'''Example''': a simple, static model for describing the output of a particular monitoring instrument over time might be:
*60% chance of symbol NEUTRAL
*20% chance of symbol POSITIVE
*10% chance of symbol NEGATIVE
*10% chance of symbol END-OF-DATA. ''(The presence of this symbol means that the stream will be 'internally terminated', as is fairly common in data compression; when this symbol appears in the data stream, the decoder will know that the entire stream has been decoded.)''

Models can also handle alphabets other than the simple four-symbol set chosen for this example. More sophisticated models are also possible: ''higher-order'' modelling changes its estimation of the current probability of a symbol based on the symbols that precede it (the ''context''), so that in a model for English text, for example, the percentage chance of "u" would be much higher when it follows a "Q" or a "q". Models can even be ''[[adaptive coding|adaptive]]'', so that they continually change their prediction of the data based on what the stream actually contains. The decoder must have the same model as the encoder.

===Encoding and decoding: overview===

In general, each step of the encoding process, except for the last, is the same; the encoder has basically just three pieces of data to consider:
* The next symbol that needs to be encoded
* The current [[interval (mathematics)|interval]] (at the very start of the encoding process, the interval is set to <nowiki>[0,1]</nowiki>, but that will change)
* The probabilities the model assigns to each of the various symbols that are possible at this stage (as mentioned earlier, higher-order or adaptive models mean that these probabilities are not necessarily the same in each step.)

The encoder divides the current interval into sub-intervals, each representing a fraction of the current interval proportional to the probability of that symbol in the current context. Whichever interval corresponds to the actual symbol that is next to be encoded becomes the interval used in the next step.

'''Example''': for the four-symbol model above:
* the interval for NEUTRAL would be <nowiki>[0, 0.6)</nowiki>
* the interval for POSITIVE would be <nowiki>[0.6, 0.8)</nowiki>
* the interval for NEGATIVE would be <nowiki>[0.8, 0.9)</nowiki>
* the interval for END-OF-DATA would be <nowiki>[0.9, 1)</nowiki>.

When all symbols have been encoded, the resulting interval unambiguously identifies the sequence of symbols that produced it. Anyone who has the same final interval and model that is being used can reconstruct the symbol sequence that must have entered the encoder to result in that final interval.

It is not necessary to transmit the final interval, however; it is only necessary to transmit ''one fraction'' that lies within that interval. In particular, it is only necessary to transmit enough digits (in whatever base) of the fraction so that all fractions that begin with those digits fall into the final interval; this will guarantee that the resulting code is a [[prefix code]].

===Encoding and decoding: example===
[[Image:Arithmetic encoding.svg|400px|thumb|right|A diagram showing decoding of 0.538 (the round dot) in the example model. The region is divided into subregions proportional to symbol frequencies, then the subregion containing the point is successively subdivided in the same way.]]
Consider the process for decoding a message encoded with the given four-symbol model. The message is encoded in the fraction 0.538 (using decimal for clarity, instead of binary; also assuming that there are only as many digits as needed to decode the message.)

The process starts with the same interval used by the encoder: <nowiki>[0,1)</nowiki>, and using the same model, dividing it into the same four sub-intervals that the encoder must have. The fraction 0.538 falls into the sub-interval for NEUTRAL, <nowiki>[0, 0.6)</nowiki>; this indicates that the first symbol the encoder read must have been NEUTRAL, so this is the first symbol of the message.

Next divide the interval <nowiki>[0, 0.6)</nowiki> into sub-intervals:
* the interval for NEUTRAL would be <nowiki>[0, 0.36)</nowiki>, ''60% of <nowiki>[0, 0.6)</nowiki>''.
* the interval for POSITIVE would be <nowiki>[0.36, 0.48)</nowiki>, ''20% of <nowiki>[0, 0.6)</nowiki>''.
* the interval for NEGATIVE would be <nowiki>[0.48, 0.54)</nowiki>, ''10% of <nowiki>[0, 0.6)</nowiki>''.
* the interval for END-OF-DATA would be <nowiki>[0.54, 0.6)</nowiki>, ''10% of <nowiki>[0, 0.6)</nowiki>''.

Since 0.538 is within the interval <nowiki>[0.48, 0.54)</nowiki>, the second symbol of the message must have been NEGATIVE.

Again divide our current interval into sub-intervals:
* the interval for NEUTRAL would be <nowiki>[0.48, 0.516)</nowiki>.
* the interval for POSITIVE would be <nowiki>[0.516, 0.528)</nowiki>.
* the interval for NEGATIVE would be <nowiki>[0.528, 0.534)</nowiki>.
* the interval for END-OF-DATA would be <nowiki>[0.534, 0.540)</nowiki>.

Now 0.538 falls within the interval of the END-OF-DATA symbol; therefore, this must be the next symbol. Since it is also the internal termination symbol, it means the decoding is complete. If the stream is not internally terminated, there needs to be some other way to indicate where the stream stops. Otherwise, the decoding process could continue forever, mistakenly reading more symbols from the fraction than were in fact encoded into it.

===Sources of inefficiency===

The message 0.538 in the previous example could have been encoded by the equally short fractions 0.534, 0.535, 0.536, 0.537 or 0.539. This suggests that the use of decimal instead of binary introduced some inefficiency. This is correct; the information content of a three-digit decimal is <math>3 \times \log_2(10) \approx 9.966</math> [[bit]]s; the same message could have been encoded in the binary fraction 0.10001001 (equivalent to 0.53515625 decimal) at a cost of only 8bits.

This 8 bit output is larger than the information content, or [[information entropy|entropy]], of the message, which is

:<math qid=Q2651> \sum -\log_2(p_i) = -\log_2(0.6) - \log_2(0.1) - \log_2(0.1) = 7.381 \text{ bits}.</math>

But an integer number of bits must be used in the binary encoding, so an encoder for this message would use at least 8 bits, resulting in a message 8.4% larger than the entropy contents. This inefficiency of at most 1 bit results in relatively less overhead as the message size grows.

Moreover, the claimed symbol probabilities were <nowiki>[0.6,&nbsp;0.2,&nbsp;0.1,&nbsp;0.1)</nowiki>, but the actual frequencies in this example are <nowiki>[0.33,&nbsp;0,&nbsp;0.33,&nbsp;0.33)</nowiki>. If the intervals are readjusted for these frequencies, the entropy of the message would be 4.755 bits and the same NEUTRAL NEGATIVE END-OF-DATA message could be encoded as intervals <nowiki>[0,&nbsp;1/3); [1/9,&nbsp;2/9); [5/27,&nbsp;6/27);</nowiki> and a binary interval of <nowiki>[0.00101111011, 0.00111000111)</nowiki>. This is also an example of how statistical coding methods like arithmetic encoding can produce an output message that is larger than the input message, especially if the probability model is off.