Editing Arithmetic coding (section)

==Arithmetic coding as a generalized change of radix==
Recall that in the case where the symbols had equal probabilities, arithmetic coding could be implemented by a simple change of base, or radix. In general, arithmetic (and range) coding may be interpreted as a ''generalized'' change of [[radix]]. For example, we may look at any sequence of symbols:

:<math>\mathrm{DABDDB}</math>

as a number in a certain base presuming that the involved symbols form an ordered set and each symbol in the ordered set denotes a sequential integer A&nbsp;=&nbsp;0, B&nbsp;=&nbsp;1, C&nbsp;=&nbsp;2, D&nbsp;=&nbsp;3, and so on. This results in the following frequencies and cumulative frequencies:

{| class="wikitable" style="text-align:center;"
!Symbol
!Frequency of occurrence
!Cumulative frequency
|-
|A
|1
|0
|-
|B
|2
|1
|-
|D
|3
|3
|}

The ''cumulative frequency'' for an item is the sum of all frequencies preceding the item. In other words, cumulative frequency is a running total of frequencies.

In a positional [[numeral system]] the radix, or base, is numerically equal to a number of different symbols used to express the number. For example, in the decimal system the number of symbols is 10, namely 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. The radix is used to express any finite integer in a presumed multiplier in polynomial form. For example, the number 457 is actually 4×10<sup>2</sup>&nbsp;+&nbsp;5×10<sup>1</sup>&nbsp;+&nbsp;7×10<sup>0</sup>, where base 10 is presumed but not shown explicitly.

Initially, we will convert DABDDB into a base-6 numeral, because 6 is the length of the string. The string is first mapped into the digit string 301331, which then maps to an integer by the polynomial:

:<math>6^5 \times 3 + 6^4 \times 0 + 6^3 \times 1 + 6^2 \times 3 + 6^1 \times 3 + 6^0 \times 1 = 23671</math>

The result 23671 has a length of 15 bits, which is not very close to the theoretical limit (the [[Entropy (information theory)|entropy]] of the message), which is approximately 9 bits.

To encode a message with a length closer to the theoretical limit imposed by information theory we need to slightly generalize the classic formula for changing the radix. We will compute lower and upper bounds ''L'' and ''U'' and choose a number between them. For the computation of ''L'' we multiply each term in the above expression by the product of the frequencies of all previously occurred symbols:

:<math>\begin{align}
  L = {} &(6^5 \times 3) + {}\\
                  & 3 \times (6^4 \times 0) + {}\\
                  & (3 \times 1) \times (6^3 \times 1) + {}\\
                  & (3 \times 1 \times 2) \times (6^2 \times 3) + {}\\
                  & (3 \times 1 \times 2 \times 3) \times (6^1 \times 3) + {}\\
                  & (3 \times 1 \times 2 \times 3 \times 3) \times (6^0 \times 1) {}\\
             = {} & 25002
\end{align}</math>

The difference between this polynomial and the polynomial above is that each term is multiplied by the product of the frequencies of all previously occurring symbols. More generally, ''L'' may be computed as:

:<math> L = \sum_{i=1}^n n^{n-i} C_i { \prod_{k=1}^{i-1} f_k } </math>

where <math>\scriptstyle C_i</math> are the cumulative frequencies and <math>\scriptstyle f_k</math> are the frequencies of occurrences. Indexes denote the position of the symbol in a message. In the special case where all frequencies <math>\scriptstyle f_k</math> are 1, this is the change-of-base formula.

The upper bound ''U'' will be ''L'' plus the product of all frequencies; in this case ''U'' = ''L'' + (3 × 1 × 2 × 3 × 3 × 2) = 25002 + 108 = 25110. In general, ''U'' is given by:

:<math> U = L + \prod_{k=1}^{n} f_k </math>

Now we can choose any number from the interval [''L'', ''U'') to represent the message; one convenient choice is the value with the longest possible trail of zeroes, 25100, since it allows us to achieve compression by representing the result as 251×10<sup>2</sup>. The zeroes can also be truncated, giving 251, if the length of the message is stored separately. Longer messages will tend to have longer trails of zeroes.

To decode the integer 25100, the polynomial computation can be reversed as shown in the table below. At each stage the current symbol is identified, then the corresponding term is subtracted from the result.

{| class="wikitable" style="text-align:center;"
!Remainder
!Identification
!Identified symbol
!Corrected remainder
|-
|25100
|style="text-align:right;"|25100 / 6<sup>5</sup> = 3
|D
|style="text-align:right;"|(25100 − 6<sup>5</sup> × 3) / 3 = 590
|-
|590
|style="text-align:right;"|590 / 6<sup>4</sup> = 0
|A
|style="text-align:right;"|(590 − 6<sup>4</sup> × 0) / 1 = 590
|-
|590
|style="text-align:right;"|590 / 6<sup>3</sup> = 2
|B
|style="text-align:right;"|(590 − 6<sup>3</sup> × 1) / 2 = 187
|-
|187
|style="text-align:right;"|187 / 6<sup>2</sup> = 5
|D
|style="text-align:right;"|(187 − 6<sup>2</sup> × 3) / 3 = 26{{figure space}}
|-
|26
|style="text-align:right;"|26 / 6<sup>1</sup> = 4
|D
|style="text-align:right;"|(26 − 6<sup>1</sup> × 3) / 3 = 2{{figure space}}{{figure space}}
|-
|2
|style="text-align:right;"|2 / 6<sup>0</sup> = 2
|B
|—
|}

During decoding we take the floor after dividing by the corresponding power of 6. The result is then matched against the cumulative intervals and the appropriate symbol is selected from look up table. When the symbol is identified the result is corrected. The process is continued for the known length of the message or while the remaining result is positive. The only difference compared to the classical change-of-base is that there may be a range of values associated with each symbol. In this example, A is always 0, B is either 1 or 2, and D is any of 3, 4, 5. This is in exact accordance with our intervals that are determined by the frequencies. When all intervals are equal to 1 we have a special case of the classic base change.

===Theoretical limit of compressed message===
The lower bound ''L'' never exceeds ''n''<sup>''n''</sup>, where ''n'' is the size of the message, and so can be represented in <math>\log_2(n^n) = n \log_2(n)</math> bits. After the computation of the upper bound ''U'' and the reduction of the message by selecting a number from the interval [''L'',&nbsp;''U'') with the longest trail of zeros we can presume that this length can be reduced by <math>\textstyle \log_2\left(\prod_{k=1}^n f_k\right)</math> bits. Since each frequency in a product occurs exactly the same number of times as the value of this frequency, we can use the size of the alphabet ''A'' for the computation of the product

:<math> \prod_{k=1}^n f_k = \prod_{k=1}^A f_k^{f_k}.</math>

Applying log<sub>2</sub> for the estimated number of bits in the message, the final message (not counting a logarithmic overhead for the message length and frequency tables) will match the number of bits given by [[entropy (information theory)|entropy]], which for long messages is very close to optimal:

:<math>-\left[\sum_{i=1}^A f_i \log_2(f_i)\right] n = n H</math>
In other words, the efficiency of arithmetic encoding approaches the theoretical limit of <math>H</math> bits per symbol, as the message length approaches infinity.

==== Asymptotic equipartition ====
We can understand this intuitively. Suppose the source is ergodic, then it has the [[asymptotic equipartition property]] (AEP). By the AEP, after a long stream of <math>n</math> symbols, the interval of <math>(0, 1)</math> is almost partitioned into almost equally-sized intervals.

Technically, for any small <math>\epsilon > 0</math>, for all large enough <math>n</math>, there exists <math>2^{nH(X)(1+O(\epsilon))}</math> strings <math>x_{1:n}</math>, such that each string has almost equal probability <math>Pr(x_{1:n}) = 2^{-nH(X)(1+ O(\epsilon))} </math>, and their total probability is <math>1-O(\epsilon)</math>.

For any such string, it is arithmetically encoded by a binary string of length <math>k</math>, where <math>k</math> is the smallest <math>k</math> such that there exists a fraction of form <math>\frac{?}{2^k}</math> in the interval for <math>x_{1:n}</math>. Since the interval for <math>x_{1:n}</math> has size <math>2^{-nH(X)(1+ O(\epsilon))} </math>, we should expect it to contain one fraction of form <math>\frac{?}{2^k}</math> when <math>k = nH(X)(1+O(\epsilon))</math>.

Thus, with high probability, <math>x_{1:n}</math> can be arithmetically encoded with a binary string of length <math>nH(X) ( 1 + O(\epsilon))</math>.