Editing Huffman coding (section)

== Problem definition ==
{{More citations needed|date=December 2021}}
[[File:HuffmanCodeAlg.png|thumb|Constructing a Huffman tree]]

=== Informal description ===
;Given: A set of symbols <math>S</math> and for each symbol <math>x \in S</math>, the frequency <math>f_x</math> representing the fraction of symbols in the text that are equal to <math>x</math>.<ref name="kleinberg&Tardos">{{cite book |last1=Kleinberg |first1=Jon |last2=Tardos |first2=Eva |title=Algorithm Design |date=March 16, 2005 |publisher=[[Pearson Education]] |isbn=9780321295354 |page=165 |edition=1 |url=https://www.pearson.com/en-us/subject-catalog/p/algorithm-design/P200000003259/9780137546350 |access-date=26 January 2025}}</ref>
;Find: A [[Prefix code|prefix-free binary code]] (a set of codewords) with minimum [[Expected value|expected]] codeword length (equivalently, a tree with minimum [[weighted path length from the root]]).

=== Formalized description ===
'''Input'''.<br />
Alphabet <math>A = (a_{1},a_{2},\dots,a_{n})</math>, which is the symbol alphabet of size <math>n</math>. <br />
Tuple <math>W = (w_{1},w_{2},\dots,w_{n})</math>, which is the tuple of the (positive) symbol weights (usually proportional to probabilities), i.e. <math>w_{i} = \operatorname{weight}\left(a_{i}\right),\, i \in \{1, 2, \dots, n\}</math>. <br />
<br />
'''Output'''.<br />
Code <math>C\left(W\right) = (c_{1},c_{2},\dots,c_{n})</math>, which is the tuple of (binary) codewords, where <math>c_{i}</math> is the codeword for <math>a_{i},\, i \in \{1, 2, \dots, n\}</math>.<br />
<br />
'''Goal'''.<br />
Let <math display="inline">L(C(W)) = \sum_{i=1}^n w_i\operatorname{length}(c_i)</math> be the weighted path length of code <math>C</math>. Condition: <math>L(C(W)) \leq L(T(W))</math> for any code <math>T(W)</math>.

=== Example ===

We give an example of the result of Huffman coding for a code with five characters and given weights. We will not verify that it minimizes ''L'' over all codes, but we will compute ''L'' and compare it to the [[Shannon entropy]] ''H'' of the given set of weights; the result is nearly optimal.

{|class="wikitable"
!rowspan="2" style="background:#efefef"| Input (''A'', ''W'')
!style="background:#efefef;font-weight:normal"| Symbol ({{math|''a''<sub>''i''</sub>}})
|align="center" style="background:#efefef"| a
|align="center" style="background:#efefef"| b
|align="center" style="background:#efefef"| c
|align="center" style="background:#efefef"| d
|align="center" style="background:#efefef"| e
!style="background:#efefef"| Sum
|-
!style="background:#efefef;font-weight:normal"| Weights ({{math|''w''<sub>''i''</sub>}})
|align="center"| 0.10
|align="center"| 0.15
|align="center"| 0.30
|align="center"| 0.16
|align="center"| 0.29
|align="center"| = 1
|-
!rowspan="3" style="background:#efefef"| Output ''C''
!style="background:#efefef;font-weight:normal"| Codewords ({{math|''c''<sub>''i''</sub>}})
|align="center"| <code>010</code>
|align="center"| <code>011</code>
|align="center"| <code>11</code>
|align="center"| <code>00</code>
|align="center"| <code>10</code>
|rowspan="2"|&nbsp;
|-
!style="background:#efefef;font-weight:normal"| Codeword length (in bits)<br />({{math|''ℓ''<sub>''i''</sub>}})
|align="center"| 3
|align="center"| 3
|align="center"| 2
|align="center"| 2
|align="center"| 2
|-
!style="background:#efefef;font-weight:normal"| Contribution to weighted path length<br />({{math|''ℓ''<sub>''i''</sub>}} {{math|''w''<sub>''i''</sub>}} )
|align="center"| 0.30
|align="center"| 0.45
|align="center"| 0.60
|align="center"| 0.32
|align="center"| 0.58
|align="center"| ''L''(''C'') = 2.25
|-
!rowspan="3" style="background:#efefef"| Optimality
!style="background:#efefef;font-weight:normal"| Probability budget<br />({{math|2<sup>−''ℓ''<sub>''i''</sub></sup>}})
| align="center" | 1/8
| align="center" | 1/8
| align="center" | 1/4
| align="center" | 1/4
| align="center" | 1/4
| align="center" | = 1.00
|-
! style="background: #efefef; font-weight: normal;" | Information content (in bits)<br />({{math|−log<sub>2</sub> ''w''<sub>''i''</sub>}}) ≈
|align="center"| 3.32
|align="center"| 2.74
|align="center"| 1.74
|align="center"| 2.64
|align="center"| 1.79
|align="center"| &nbsp;
|-
! style="background: #efefef; font-weight: normal;" | Contribution to entropy<br />({{math|−''w''<sub>''i''</sub> log<sub>2</sub> ''w''<sub>''i''</sub>}})
|align="center"| 0.332
|align="center"| 0.411
|align="center"| 0.521
|align="center"| 0.423
|align="center"| 0.518
|align="center"| ''H''(''A'') = 2.205
|}

For any code that is ''biunique'', meaning that the code is [[Variable-length code#Uniquely decodable codes|''uniquely decodeable'']], the sum of the probability budgets across all symbols is always less than or equal to one. In this example, the sum is strictly equal to one; as a result, the code is termed a ''complete'' code. If this is not the case, one can always derive an equivalent code by adding extra symbols (with associated null probabilities), to make the code complete while keeping it ''biunique''.

As defined by [[A Mathematical Theory of Communication|Shannon (1948)]], the information content ''h'' (in bits) of each symbol ''a''<sub>i</sub> with non-null probability is

:<math>h(a_i) = \log_2{1 \over w_i}. </math>

The [[information entropy|entropy]] ''H'' (in bits) is the weighted sum, across all symbols {{math|''a''<sub>''i''</sub>}} with non-zero probability {{math|''w''<sub>''i''</sub>}}, of the information content of each symbol:

:<math display="block"> H(A) = \sum_{w_i > 0} w_i h(a_i) = \sum_{w_i > 0} w_i \log_2 {1 \over w_i} = - \sum_{w_i > 0} w_i \log_2 w_i. </math>

(Note: A symbol with zero probability has zero contribution to the entropy, since <math>\lim_{w \to 0^+} w \log_2 w = 0</math>.  So for simplicity, symbols with zero probability can be left out of the formula above.)

As a consequence of [[Shannon's source coding theorem]], the entropy is a measure of the smallest codeword length that is theoretically possible for the given alphabet with associated weights. In this example, the weighted average codeword length is 2.25 bits per symbol, only slightly larger than the calculated entropy of 2.205 bits per symbol. So not only is this code optimal in the sense that no other feasible code performs better, but it is very close to the theoretical limit established by Shannon.

In general, a Huffman code need not be unique.  Thus the set of Huffman codes for a given probability distribution is a non-empty subset of the codes minimizing <math>L(C)</math> for that probability distribution.  (However, for each minimizing codeword length assignment, there exists at least one Huffman code with those lengths.)