Editing Shannon–Fano coding (section)

==Fano's code: binary splitting==

===Outline of Fano's code===

In Fano's method, the symbols are arranged in order from most probable to least probable, and then divided into two sets whose total probabilities are as close as possible to being equal.  All symbols then have the first digits of their codes assigned; symbols in the first set receive "0" and symbols in the second set receive "1".  As long as any sets with more than one member remain, the same process is repeated on those sets, to determine successive digits of their codes.  When a set has been reduced to one symbol this means the symbol's code is complete and will not form the prefix of any other symbol's code.

The algorithm produces fairly efficient variable-length encodings; when the two smaller sets produced by a partitioning are in fact of equal probability, the one bit of information used to distinguish them is used most efficiently.  Unfortunately, Shannon–Fano coding does not always produce optimal prefix codes; the set of probabilities {0.35, 0.17, 0.17, 0.16, 0.15} is an example of one that will be assigned non-optimal codes by Shannon–Fano coding.

Fano's version of Shannon–Fano coding is used in the <code>IMPLODE</code> compression method, which is part of the [[Zip (file format)|<code>ZIP</code> file format]].<ref name="appnote">{{cite web
| url        = http://www.pkware.com/documents/casestudies/APPNOTE.TXT
| title      = <code>APPNOTE.TXT</code> - .ZIP File Format Specification
| access-date = 2008-01-06
| publisher  = PKWARE Inc
| date       = 2007-09-28
| quote      = The Imploding algorithm is actually a combination of two distinct algorithms.  The first algorithm compresses repeated byte sequences using a sliding dictionary.  The second algorithm is used to compress the encoding of the sliding dictionary output, using multiple Shannon–Fano trees.
}}</ref>

===The Shannon–Fano tree===

A Shannon–Fano tree is built according to a specification designed to define an effective code table. The actual algorithm is simple:

# For a given list of symbols, develop a corresponding list of [[probabilities]] or frequency counts so that each symbol’s relative frequency of occurrence is known.
# Sort the lists of symbols according to frequency, with the most frequently occurring symbols at the left and the least common at the right.
# Divide the list into two parts, with the total frequency counts of the left part being as close to the total of the right as possible.
# The left part of the list is assigned the binary digit 0, and the right part is assigned the digit 1. This means that the codes for the symbols in the first part will all start with 0, and the codes in the second part will all start with 1.
# Recursively apply the steps 3 and 4 to each of the two halves, subdividing groups and adding bits to the codes until each symbol has become a corresponding code leaf on the tree.

===Example===

[[Image:ShannonCodeAlg.svg|right|thumb|300px|Shannon–Fano Algorithm]]

We continue with the previous example.

:{| class="wikitable" style="text-align: center;"
! Symbol
! A
! B
! C
! D
! E
|-
! Count
| 15
| 7
| 6
| 6
| 5
|-
! Probabilities
| 0.385
| 0.179
| 0.154
| 0.154
| 0.128
|}

All symbols are sorted by frequency, from left to right (shown in Figure a). Putting the dividing line between symbols B and C results in a total of 22 in the left group and a total of 17 in the right group. This minimizes the difference in totals between the two groups.

With this division, A and B will each have a code that starts with a 0 bit, and the C, D, and E codes will all start with a 1, as shown in Figure b. Subsequently, the left half of the tree gets a new division between A and B, which puts A on a leaf with code 00 and B on a leaf with code 01.

After four division procedures, a tree of codes results. In the final tree, the three symbols with the highest frequencies have all been assigned 2-bit codes, and two symbols with lower counts have 3-bit codes as shown table below:

:{| class="wikitable" style="text-align: center;"
! Symbol
! A
! B
! C
! D
! E
|-
! Probabilities
| 0.385
| 0.179
| 0.154
| 0.154
| 0.128
|-
! First division
| colspan="2" | 0
| colspan="3" | 1
|-
! Second division
| rowspan="2" style="vertical-align:top"; | 0
| rowspan="2" style="vertical-align:top"; | 1
| rowspan="2" style="vertical-align:top"; | 0
| colspan="2" | 1
|-
! Third division
| 0
| 1
|-
! Codewords
| 00
| 01
| 10
| 110
| 111
|}

This results in lengths of 2 bits for A, B and C and per 3 bits for D and E, giving an average length of

:<math display="block">\frac{2\,\text{bits}\cdot(15+7+6) + 3\,\text{bits} \cdot (6+5)}{39\, \text{symbols}} \approx 2.28\,\text{bits per symbol.}</math>

We see that Fano's method, with an average length of 2.28, has outperformed Shannon's method, with an average length of 2.62.

=== Expected word length ===

It is shown by Krajči et al<ref name="Kraj" /> that the expected length of Fano's method has expected length bounded above by <math>\mathbb{E}L \leq H(X) + 1 - p_\text{min}</math>, where <math>p_\text{min} = \textstyle\min_i p_i</math> is the probability of the least common symbol.