Editing Truncated binary encoding

{{More references|date=December 2009}}

'''Truncated binary encoding''' is an [[entropy encoding]] typically used for uniform [[probability distribution]]s with a finite alphabet. It is parameterized by an alphabet with total size of number ''n''. It is a slightly more general form of [[Binary numeral system|binary encoding]] when ''n'' is not a [[power of two]].

If ''n'' is a power of two, then the coded value for 0 ≤ ''x'' < ''n'' is the simple binary code for ''x'' of length log<sub>2</sub>(''n''). Otherwise let ''k'' = floor(log<sub>2</sub>(''n'')), such that 2<sup>''k''</sup> < ''n'' < 2<sup>''k''+1</sup>and let ''u'' = 2<sup>''k''+1</sup> − ''n''.

Truncated binary encoding assigns the first ''u'' symbols codewords of length ''k'' and then assigns the remaining ''n'' − ''u'' symbols the last ''n'' − ''u'' codewords of length ''k'' + 1. Because all the codewords of length ''k'' + 1 consist of an unassigned codeword of length ''k'' with a "0" or "1" appended, the resulting code is a [[prefix code]].

== History ==

Used since at least 1984, '''phase-in codes''', also known as '''economy codes''',<ref>Eastman, Willard L, ''et al.'' (Aug. 1984) [https://patentimages.storage.googleapis.com/86/7d/56/dd49314023387d/US4464650.pdf Apparatus and Method for Compressing Data Signals and Restoring the Compressed Data Signals], US Patent 4,464,650.</ref><ref>Acharya, Tinku et Já Já, Joseph F. (oct. 1996), [https://www.sciencedirect.com/science/article/abs/pii/0020025596000898 An on-line variable-length binary encoding of text], Information Sciences, vol 94 no 1-4, p. 1-22.</ref><ref>Job van der Zwan. [https://observablehq.com/@jobleonard/phase-in-codes "Phase-in Codes"].</ref> are also known as truncated binary encoding.

==Example with ''n'' = 5==
For example, for the alphabet {0, 1, 2, 3, 4}, ''n'' = 5 and 2<sup>2</sup> ≤ ''n'' < 2<sup>3</sup>, hence ''k'' = 2 and ''u'' = 2<sup>3</sup> − 5 = 3. Truncated binary encoding assigns the first ''u'' symbols the codewords 00, 01, and 10, all of length 2, then assigns the last ''n'' − ''u'' symbols the codewords 110 and 111, the last two codewords of length 3.

For example, if ''n'' is 5, plain binary encoding and truncated binary encoding allocates the following [[Code word (communication)|codewords]]. Digits shown <del>struck</del> are not transmitted in truncated binary.
{| class="wikitable" style="text-align:center"
|-
! Truncated<br/>binary !!colspan=3| Encoding !! Standard<br/>binary
|-
|align=right| 0 ||bgcolor=silver| <del>0</del> || 0 || 0 ||align=left| 0
|-
|align=right| 1 ||bgcolor=silver| <del>0</del> || 0 || 1 ||align=left| 1
|-
|align=right| 2 ||bgcolor=silver| <del>0</del> || 1 || 0 ||align=left| 2
|-
|align=right| UNUSED ||bgcolor=silver| <del>0</del> ||bgcolor=silver| <del>1</del> ||bgcolor=silver| <del>1</del> ||align=left| 3
|-
|align=right| UNUSED ||bgcolor=silver| <del>1</del> ||bgcolor=silver| <del>0</del> ||bgcolor=silver| <del>0</del> ||align=left| 4
|-
|align=right| UNUSED ||bgcolor=silver| <del>1</del> ||bgcolor=silver| <del>0</del> ||bgcolor=silver| <del>1</del> ||align=left| 5/UNUSED
|-
|align=right| 3 || 1 || 1 || 0 ||align=left| 6/UNUSED
|-
|align=right| 4 || 1 || 1 || 1 ||align=left| 7/UNUSED
|}

It takes 3 bits to encode ''n'' using straightforward binary encoding, hence 2<sup>3</sup> − ''n'' = 8 − 5 = 3 are unused.

In numerical terms, to send a value ''x'', where 0 ≤ ''x'' < ''n'', and where there are 2<sup>''k''</sup> ≤ ''n'' < 2<sup>''k''+1</sup> symbols, there are ''u'' = 2<sup>''k''+1</sup> − ''n'' unused entries when the alphabet size is rounded up to the nearest power of two. The process to encode the number ''x'' in truncated binary is: if ''x'' is less than ''u'', encode it in ''k'' binary bits; if ''x'' is greater than or equal to ''u'', encode the value ''x'' + ''u'' in ''k'' + 1 binary bits.

==Example with ''n'' = 10==
Another example, encoding an alphabet of size 10 (between 0 and 9) requires 4&nbsp;bits, but there are 2<sup>4</sup> − 10 = 6 unused codes, so input values less than 6 have the first bit discarded, while input values greater than or equal to 6 are offset by 6 to the end of the binary space. (Unused patterns are not shown in this table.)

{| class="wikitable" style="text-align:center"
|-
! Input<br/>value !! Offset !! Offset<br/>value !! Standard<br/>binary || Truncated<br />binary
|-
| 0 || 0 ||  0 || <del>0</del>000 || 000 
|-
| 1 || 0 ||  1 || <del>0</del>001 || 001 
|-
| 2 || 0 ||  2 || <del>0</del>010 || 010 
|-
| 3 || 0 ||  3 || <del>0</del>011 || 011 
|-
| 4 || 0 ||  4 || <del>0</del>100 || 100 
|-
| 5 || 0 ||  5 || <del>0</del>101 || 101 
|-
|colspan=5|
|-
| 6 || 6 || 12 || 0110 || 1100
|-
| 7 || 6 || 13 || 0111 || 1101
|-
| 8 || 6 || 14 || 1000 || 1110
|-
| 9 || 6 || 15 || 1001 || 1111
|}

To decode, read the first ''k'' bits. If they encode a value less than ''u'', decoding is complete. Otherwise, read an additional bit and subtract ''u'' from the result.

==Example with ''n'' = 7==
Here is a more extreme case: with ''n'' = 7 the next power of 2 is 8, so ''k'' = 2 and ''u'' = 2<sup>3</sup> − 7 = 1:
{| class="wikitable" style="text-align:center"
|-
! Input<br/>value !! Offset !! Offset<br />value !! Standard<br/>binary || Truncated<br />binary
|-
| 0 || 0 || 0 || <del>0</del>00 || 00 
|-
|colspan=5|
|-
| 1 || 1 || 2 || 001 || 010
|-
| 2 || 1 || 3 || 010 || 011
|-
| 3 || 1 || 4 || 011 || 100
|-
| 4 || 1 || 5 || 100 || 101
|-
| 5 || 1 || 6 || 101 || 110
|-
| 6 || 1 || 7 || 110 || 111
|}

This last example demonstrates that a leading zero bit does not always indicate a short code; if ''u'' < 2<sup>''k''</sup>, some long codes will begin with a zero bit.

== Simple algorithm ==

Generate the truncated binary encoding for a value ''x'', 0 ≤ ''x'' < ''n'', where ''n'' > 0 is the size of the alphabet containing ''x''. ''n'' need not be a power of two.  
<syntaxhighlight lang="C">
string TruncatedBinary (int x, int n)
{
	// Set k = floor(log2(n)), i.e., k such that 2^k <= n < 2^(k+1).
	int k = 0, t = n;
	while (t > 1) { k++;  t >>= 1; }

	// Set u to the number of unused codewords = 2^(k+1) - n.
	int u = (1 << k + 1) - n;

	if (x < u)
        return Binary(x, k); 
    else
        return Binary(x + u, k + 1));
}
</syntaxhighlight>
The routine <code>Binary</code> is expository; usually just the rightmost <code>len</code> bits of the variable ''x'' are desired.
Here we simply output the binary code for ''x'' using <code>len</code> bits, padding with high-order 0s if necessary.
<syntaxhighlight lang="C">
string Binary (int x, int len)
{
	string s = "";
	while (x != 0) {
		if (even(x))
            s = '0' + s;
		else  s = '1' + s;
		
		x >>= 1;
	}
	while (s.Length < len)
        s = '0' + s;
	return s;
}
</syntaxhighlight>

== On efficiency ==
If ''n'' is not a power of two, and ''k''-bit symbols are observed with probability ''p'', then (''k'' + 1)-bit symbols are observed with probability 1 − ''p''. We can calculate the expected number of bits per symbol <math>b_e</math> as

: <math>b_e = p k + (1 - p) (k + 1).</math>

Raw encoding of the symbol has <math>b_u = k + 1</math> bits. Then relative space saving ''s'' (see [[Data compression ratio]]) of the encoding can be defined as

: <math>s = 1 - \frac{b_e}{b_u} = 1 - \frac{p k + (1 - p) (k + 1)}{k + 1}.</math>

When simplified, this expression leads to

: <math>s = \frac{p}{k + 1} = \frac{p}{b_u}.</math>

This indicates that relative efficiency of truncated binary encoding increases as probability ''p'' of ''k''-bit symbols increases, and the raw-encoding symbol bit-length <math>b_u</math> decreases.

==See also==
* [[Benford's law]]
* [[Golomb coding]]

==References==
{{Reflist}}

{{DEFAULTSORT:Truncated Binary Encoding}}
[[Category:Entropy coding]]
[[Category:Lossless compression algorithms]]