Editing Huffman coding (section)

== Basic technique ==

===Compression===
[[File:Huffman_coding_visualisation.svg|thumb|upright=1.5|Visualisation of the use of Huffman coding to encode the message "A_DEAD_DAD_{{zwsp}}CEDED_A_BAD_{{zwsp}}BABE_A_BEADED_{{zwsp}}ABACA_BED". In steps 2 to 6, the letters are sorted by increasing frequency, and the least frequent two at each step are combined and reinserted into the list, and a partial tree is constructed. The final tree in step 6 is traversed to generate the dictionary in step 7. Step 8 uses it to encode the message.]]

[[Image:Huffman coding example.svg|thumb|A source generates 4 different symbols <math>\{a_1 , a_2 , a_3 , a_4 \}</math> with probability <math>\{0.4 ; 0.35 ; 0.2 ; 0.05 \}</math>. A binary tree is generated from left to right taking the two least probable symbols and putting them together to form another equivalent symbol having a probability that equals the sum of the two symbols. The process is repeated until there is just one symbol. The tree can then be read backwards, from right to left, assigning different bits to different branches. The final Huffman code is:
{|class="wikitable"
! Symbol !! Code
|-
|a1 || 0
|-
|a2 || 10
|-
|a3 || 110
|-
|a4 || 111
|-
|}
The standard way to represent a signal made of 4 symbols is by using 2 bits/symbol, but the [[Information entropy|entropy]] of the source is 1.74 bits/symbol. If this Huffman code is used to represent the signal, then the average length is lowered to 1.85 bits/symbol; it is still far from the theoretical limit because the probabilities of the symbols are different from negative powers of two.]]

The technique works by creating a [[binary tree]] of nodes. These can be stored in a regular [[Array data type|array]], the size of which depends on the number of symbols, <math>n</math>. A node can be either a [[leaf node]] or an [[internal node]]. Initially, all nodes are leaf nodes, which contain the '''symbol''' itself, the '''weight''' (frequency of appearance) of the symbol and optionally, a link to a '''parent''' node which makes it easy to read the code (in reverse) starting from a leaf node. Internal nodes contain a '''weight''', links to '''two child nodes''' and an optional link to a '''parent''' node. As a common convention, bit '0' represents following the left child and bit '1' represents following the right child. A finished tree has up to <math>n</math> leaf nodes and <math>n-1</math> internal nodes. A Huffman tree that omits unused symbols produces the most optimal code lengths.

The process begins with the leaf nodes containing the probabilities of the symbol they represent. Then, the process takes the two nodes with smallest probability, and creates a new internal node having these two nodes as children. The weight of the new node is set to the sum of the weight of the children. We then apply the process again, on the new internal node and on the remaining nodes (i.e., we exclude the two leaf nodes), we repeat this process until only one node remains, which is the root of the Huffman tree.

The simplest construction algorithm uses a [[priority queue]] where the node with lowest probability is given highest priority:

# Create a leaf node for each symbol and add it to the priority queue.
# While there is more than one node in the queue:
## Remove the two nodes of highest priority (lowest probability) from the queue
## Create a new internal node with these two nodes as children and with probability equal to the sum of the two nodes' probabilities.
## Add the new node to the queue.
# The remaining node is the root node and the tree is complete.

Since efficient priority queue data structures require O(log ''n'') time per insertion, and a tree with ''n'' leaves has 2''n''−1 nodes, this algorithm operates in O(''n'' log ''n'') time, where ''n'' is the number of symbols.

If the symbols are sorted by probability, there is a [[linear-time]] (O(''n'')) method to create a Huffman tree using two [[Queue (data structure)|queues]], the first one containing the initial weights (along with pointers to the associated leaves), and combined weights (along with pointers to the trees) being put in the back of the second queue. This assures that the lowest weight is always kept at the front of one of the two queues:

#Start with as many leaves as there are symbols.
#Enqueue all leaf nodes into the first queue (by probability in increasing order so that the least likely item is in the head of the queue).
#While there is more than one node in the queues:
##Dequeue the two nodes with the lowest weight by examining the fronts of both queues.
##Create a new internal node, with the two just-removed nodes as children (either node can be either child) and the sum of their weights as the new weight.
##Enqueue the new node into the rear of the second queue.
#The remaining node is the root node; the tree has now been generated.

Once the Huffman tree has been generated, it is traversed to generate a dictionary which maps the symbols to binary codes as follows:

#Start with current node set to the root.
#If node is not a leaf node, label the edge to the left child as 0 and the edge to the right child as 1. Repeat the process at both the left child and the right child.

The final encoding of any symbol is then read by a concatenation of the labels on the edges along the path from the root node to the symbol.

In many cases, time complexity is not very important in the choice of algorithm here, since ''n'' here is the number of symbols in the alphabet, which is typically a very small number (compared to the length of the message to be encoded); whereas complexity analysis concerns the behavior when ''n'' grows to be very large.

It is generally beneficial to minimize the variance of codeword length. For example, a communication buffer receiving Huffman-encoded data may need to be larger to deal with especially long symbols if the tree is especially unbalanced. To minimize variance, simply break ties between queues by choosing the item in the first queue.   This modification will retain the mathematical optimality of the Huffman coding while both minimizing variance and minimizing the length of the longest character code.
<!--
Here's an example of optimized Huffman coding using the French subject string "j'aime aller sur le bord de l'eau les jeudis ou les jours impairs", where the symbols are those that occur in the string, and their weight is their number of occurrences. Note that the original Huffman coding tree structure would be different from the given example:

[[Image:Huffman huff demo.gif|center]]
-->

===Decompression===
Generally speaking, the process of decompression is simply a matter of translating the stream of prefix codes to individual byte values, usually by traversing the Huffman tree node by node as each bit is read from the input stream (reaching a leaf node necessarily terminates the search for that particular byte value). Before this can take place, however, the Huffman tree must be somehow reconstructed. In the simplest case, where character frequencies are fairly predictable, the tree can be preconstructed (and even statistically adjusted on each compression cycle) and thus reused every time, at the expense of at least some measure of compression efficiency. Otherwise, the information to reconstruct the tree must be sent a priori. A naive approach might be to prepend the frequency count of each character to the compression stream. Unfortunately, the overhead in such a case could amount to several kilobytes, so this method has little practical use. If the data is compressed using [[canonical Huffman code|canonical encoding]], the compression model can be precisely reconstructed with just <math>B\cdot 2^B</math> bits of information (where {{mvar|B}} is the number of bits per symbol). Another method is to simply prepend the Huffman tree, bit by bit, to the output stream. For example, assuming that the value of 0 represents a parent node and 1 a leaf node, whenever the latter is encountered the tree building routine simply reads the next 8 bits to determine the character value of that particular leaf. The process continues recursively until the last leaf node is reached; at that point, the Huffman tree will thus be faithfully reconstructed. The overhead using such a method ranges from roughly 2 to 320 bytes (assuming an 8-bit alphabet). Many other techniques are possible as well. In any case, since the compressed data can include unused "trailing bits" the decompressor must be able to determine when to stop producing output. This can be accomplished by either transmitting the length of the decompressed data along with the compression model or by defining a special code symbol to signify the end of input (the latter method can adversely affect code length optimality, however).