Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Huffman coding
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Main properties == The probabilities used can be generic ones for the application domain that are based on average experience, or they can be the actual frequencies found in the text being compressed. This requires that a [[frequency table]] must be stored with the compressed text. See the Decompression section above for more information about the various techniques employed for this purpose. === Optimality === {{See also|Arithmetic coding#Huffman coding}} Huffman's original algorithm is optimal for a symbol-by-symbol coding with a known input probability distribution, i.e., separately encoding unrelated symbols in such a data stream. However, it is not optimal when the symbol-by-symbol restriction is dropped, or when the [[probability mass function]]s are unknown. Also, if symbols are not [[independent and identically distributed]], a single code may be insufficient for optimality. Other methods such as [[arithmetic coding]] often have better compression capability. Although both aforementioned methods can combine an arbitrary number of symbols for more efficient coding and generally adapt to the actual input statistics, arithmetic coding does so without significantly increasing its computational or algorithmic complexities (though the simplest version is slower and more complex than Huffman coding). Such flexibility is especially useful when input probabilities are not precisely known or vary significantly within the stream. However, Huffman coding is usually faster and arithmetic coding was historically a subject of some concern over [[patent]] issues. Thus many technologies have historically avoided arithmetic coding in favor of Huffman and other prefix coding techniques. As of mid-2010, the most commonly used techniques for this alternative to Huffman coding have passed into the public domain as the early patents have expired. For a set of symbols with a uniform probability distribution and a number of members which is a [[power of two]], Huffman coding is equivalent to simple binary [[Block code|block encoding]], e.g., [[ASCII]] coding. This reflects the fact that compression is not possible with such an input, no matter what the compression method, i.e., doing nothing to the data is the optimal thing to do. Huffman coding is optimal among all methods in any case where each input symbol is a known independent and identically distributed random variable having a probability that is [[Dyadic distribution|dyadic]]. Prefix codes, and thus Huffman coding in particular, tend to have inefficiency on small alphabets, where probabilities often fall between these optimal (dyadic) points. The worst case for Huffman coding can happen when the probability of the most likely symbol far exceeds 2<sup>β1</sup> = 0.5, making the upper limit of inefficiency unbounded. There are two related approaches for getting around this particular inefficiency while still using Huffman coding. Combining a fixed number of symbols together ("blocking") often increases (and never decreases) compression. As the size of the block approaches infinity, Huffman coding theoretically approaches the entropy limit, i.e., optimal compression.<ref>{{cite arXiv|last=Gribov|first=Alexander|date=2017-04-10|title=Optimal Compression of a Polyline with Segments and Arcs|class=cs.CG|eprint=1604.07476}}</ref> However, blocking arbitrarily large groups of symbols is impractical, as the complexity of a Huffman code is linear in the number of possibilities to be encoded, a number that is exponential in the size of a block. This limits the amount of blocking that is done in practice. A practical alternative, in widespread use, is [[run-length encoding]]. This technique adds one step in advance of entropy coding, specifically counting (runs) of repeated symbols, which are then encoded. For the simple case of [[Bernoulli process]]es, [[Golomb coding]] is optimal among prefix codes for coding run length, a fact proved via the techniques of Huffman coding.<ref>{{Cite journal | last1 = Gallager | first1 = R.G. |last2 = van Voorhis |first2 = D.C.| title = Optimal source codes for geometrically distributed integer alphabets | journal = [[IEEE Transactions on Information Theory]]| volume = 21 | issue = 2 | pages = 228β230 | year = 1975|doi= 10.1109/TIT.1975.1055357}}</ref> A similar approach is taken by fax machines using [[modified Huffman coding]]. However, run-length coding is not as adaptable to as many input types as other compression technologies.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)