Editing Entropy (information theory) (section)

==Aspects==

===Relationship to thermodynamic entropy===
{{Main|Entropy in thermodynamics and information theory}}
The inspiration for adopting the word ''entropy'' in information theory came from the close resemblance between Shannon's formula and very similar known formulae from [[statistical mechanics]].

In [[statistical thermodynamics]] the most general formula for the thermodynamic [[entropy]] {{math|''S''}} of a [[thermodynamic system]] is the [[Gibbs entropy]]
<math display="block">S = - k_\text{B} \sum_i p_i \ln p_i \,,</math>
where {{math|''k''<sub>B</sub>}} is the [[Boltzmann constant]], and {{math|''p''<sub>''i''</sub>}} is the probability of a [[Microstate (statistical mechanics)|microstate]]. The [[Entropy (statistical thermodynamics)|Gibbs entropy]] was defined by [[J. Willard Gibbs]] in 1878 after earlier work by [[Ludwig Boltzmann]] (1872).<ref>Compare: Boltzmann, Ludwig (1896, 1898). Vorlesungen über Gastheorie : 2 Volumes – Leipzig 1895/98 UB: O 5262-6. English version: Lectures on gas theory. Translated by Stephen G. Brush (1964) Berkeley: University of California Press; (1995) New York: Dover {{isbn|0-486-68455-5}}</ref>

The Gibbs entropy translates over almost unchanged into the world of [[quantum physics]] to give the [[von Neumann entropy]] introduced by [[John von Neumann]] in 1927:
<math display="block">S = - k_\text{B} \,{\rm Tr}(\rho \ln \rho) \,,</math>
where ρ is the [[density matrix]] of the quantum mechanical system and Tr is the [[Trace (linear algebra)|trace]].<ref>{{Cite book|last=Życzkowski|first=Karol|title=Geometry of Quantum States: An Introduction to Quantum Entanglement |title-link=Geometry of Quantum States |publisher=Cambridge University Press|year=2006|pages=301}}</ref>

At an everyday practical level, the links between information entropy and thermodynamic entropy are not evident. Physicists and chemists are apt to be more interested in ''changes'' in entropy as a system spontaneously evolves away from its initial conditions, in accordance with the [[second law of thermodynamics]], rather than an unchanging probability distribution. As the minuteness of the Boltzmann constant {{math|''k''<sub>B</sub>}} indicates, the changes in {{math|''S'' / ''k''<sub>B</sub>}} for even tiny amounts of substances in chemical and physical processes represent amounts of entropy that are extremely large compared to anything in [[data compression]] or [[signal processing]]. In classical thermodynamics, entropy is defined in terms of macroscopic measurements and makes no reference to any probability distribution, which is central to the definition of information entropy.

The connection between thermodynamics and what is now known as information theory was first made by Boltzmann and expressed by his [[Boltzmann's entropy formula|equation]]:

<math display="block">S=k_\text{B} \ln W,</math>

where <math>S</math> is the thermodynamic entropy of a particular macrostate (defined by thermodynamic parameters such as temperature, volume, energy, etc.), {{math|''W''}} is the number of microstates (various combinations of particles in various energy states) that can yield the given macrostate, and {{math|''k''<sub>B</sub>}} is the Boltzmann constant.<ref>{{Cite journal|last1=Sharp|first1=Kim|last2=Matschinsky|first2=Franz|date=2015|title=Translation of Ludwig Boltzmann's Paper "On the Relationship between the Second Fundamental Theorem of the Mechanical Theory of Heat and Probability Calculations Regarding the Conditions for Thermal Equilibrium"|journal=Entropy|volume=17|pages=1971–2009|doi=10.3390/e17041971|doi-access=free}}</ref> It is assumed that each microstate is equally likely, so that the probability of a given microstate is {{math|1=''p''<sub>''i''</sub> = 1/''W''}}. When these probabilities are substituted into the above expression for the Gibbs entropy (or equivalently {{math|''k''<sub>B</sub>}} times the Shannon entropy), Boltzmann's equation results. In information theoretic terms, the information entropy of a system is the amount of "missing" information needed to determine a microstate, given the macrostate.

In the view of [[Edwin Thompson Jaynes|Jaynes]] (1957),<ref>{{Cite journal|last=Jaynes|first=E. T.|date=1957-05-15|title=Information Theory and Statistical Mechanics|url=https://link.aps.org/doi/10.1103/PhysRev.106.620|journal=Physical Review|volume=106|issue=4|pages=620–630|doi=10.1103/PhysRev.106.620|bibcode=1957PhRv..106..620J|s2cid=17870175 }}</ref> thermodynamic entropy, as explained by [[statistical mechanics]], should be seen as an ''application'' of Shannon's information theory: the thermodynamic entropy is interpreted as being proportional to the amount of further Shannon information needed to define the detailed microscopic state of the system, that remains uncommunicated by a description solely in terms of the macroscopic variables of classical thermodynamics, with the constant of proportionality being just the Boltzmann constant. Adding heat to a system increases its thermodynamic entropy because it increases the number of possible microscopic states of the system that are consistent with the measurable values of its macroscopic variables, making any complete state description longer. (See article: ''[[maximum entropy thermodynamics]]''). [[Maxwell's demon]] can (hypothetically) reduce the thermodynamic entropy of a system by using information about the states of individual molecules; but, as [[Rolf Landauer|Landauer]] (from 1961) and co-workers<ref>{{Cite journal|last=Landauer|first=R.|date=July 1961|title=Irreversibility and Heat Generation in the Computing Process|url=https://ieeexplore.ieee.org/document/5392446|journal=IBM Journal of Research and Development|volume=5|issue=3|pages=183–191|doi=10.1147/rd.53.0183|issn=0018-8646|access-date=15 December 2021|archive-date=15 December 2021|archive-url=https://web.archive.org/web/20211215235046/https://ieeexplore.ieee.org/document/5392446|url-status=live}}</ref> have shown, to function the demon himself must increase thermodynamic entropy in the process, by at least the amount of Shannon information he proposes to first acquire and store; and so the total thermodynamic entropy does not decrease (which resolves the paradox). [[Landauer's principle]] imposes a lower bound on the amount of heat a computer must generate to process a given amount of information, though modern computers are far less efficient.

===Data compression===
{{Main|Shannon's source coding theorem|Data compression}}
Shannon's definition of entropy, when applied to an information source, can determine the minimum channel capacity required to reliably transmit the source as encoded binary digits. Shannon's entropy measures the information contained in a message as opposed to the portion of the message that is determined (or predictable). Examples of the latter include redundancy in language structure or statistical properties relating to the occurrence frequencies of letter or word pairs, triplets etc. The minimum channel capacity can be realized in theory by using the [[typical set]] or in practice using [[Huffman coding|Huffman]], [[LZW|Lempel–Ziv]] or [[arithmetic coding]]. (See also [[Kolmogorov complexity]].) In practice, compression algorithms deliberately include some judicious redundancy in the form of [[checksum]]s to protect against errors. The [[entropy rate]] of a data source is the average number of bits per symbol needed to encode it. Shannon's experiments with human predictors show an information rate between 0.6 and 1.3 bits per character in English;<ref>{{cite web |url=http://marknelson.us/2006/08/24/the-hutter-prize/ |title=The Hutter Prize |access-date=2008-11-27 |date=24 August 2006 |author=Mark Nelson |archive-date=1 March 2018 |archive-url=https://web.archive.org/web/20180301161215/http://marknelson.us/2006/08/24/the-hutter-prize/ |url-status=dead }}</ref> the [[PPM compression algorithm]] can achieve a compression ratio of 1.5 bits per character in English text.

If a [[Data compression|compression]] scheme is lossless – one in which you can always recover the entire original message by decompression – then a compressed message has the same quantity of information as the original but is communicated in fewer characters. It has more information (higher entropy) per character. A compressed message has less [[redundancy (information theory)|redundancy]]. [[Shannon's source coding theorem]] states a lossless compression scheme cannot compress messages, on average, to have ''more'' than one bit of information per bit of message, but that any value ''less'' than one bit of information per bit of message can be attained by employing a suitable coding scheme. The entropy of a message per bit multiplied by the length of that message is a measure of how much total information the message contains. Shannon's theorem also implies that no lossless compression scheme can shorten ''all'' messages. If some messages come out shorter, at least one must come out longer due to the [[pigeonhole principle]]. In practical use, this is generally not a problem, because one is usually only interested in compressing certain types of messages, such as a document in English, as opposed to gibberish text, or digital photographs rather than noise, and it is unimportant if a compression algorithm makes some unlikely or uninteresting sequences larger.

A 2011 study in ''[[Science (journal)|Science]]'' estimates the world's technological capacity to store and communicate optimally compressed information normalized on the most effective compression algorithms available in the year 2007, therefore estimating the entropy of the technologically available sources.<ref name="HilbertLopez2011">[http://www.sciencemag.org/content/332/6025/60 "The World's Technological Capacity to Store, Communicate, and Compute Information"] {{Webarchive|url=https://web.archive.org/web/20130727161911/http://www.sciencemag.org/content/332/6025/60 |date=27 July 2013 }}, Martin Hilbert and Priscila López (2011), ''[[Science (journal)|Science]]'', 332(6025); free access to the article through here: martinhilbert.net/WorldInfoCapacity.html</ref>{{rp|pp=60–65}}
{| class="wikitable"
|+
All figures in entropically compressed [[exabytes]]
|-
! Type of Information !! 1986 !! 2007 
|-
| Storage || 2.6  || 295 
|-
| Broadcast || 432 || 1900 
|-
| Telecommunications || 0.281 || 65 
|}
The authors estimate humankind technological capacity to store information (fully entropically compressed) in 1986 and again in 2007. They break the information into three categories—to store information on a medium, to receive information through one-way [[broadcast]] networks, or to exchange information through two-way [[telecommunications network]]s.<ref name="HilbertLopez2011"/>

===Entropy as a measure of diversity===
{{Main|Diversity index}}
Entropy is one of several ways to measure biodiversity and is applied in the form of the [[Diversity index|Shannon index]].<ref>{{Cite journal|last1=Spellerberg|first1=Ian F.|last2=Fedor|first2=Peter J.|date=2003|title=A tribute to Claude Shannon (1916–2001) and a plea for more rigorous use of species richness, species diversity and the 'Shannon–Wiener' Index|journal=Global Ecology and Biogeography|language=en|volume=12|issue=3|pages=177–179|doi=10.1046/j.1466-822X.2003.00015.x|bibcode=2003GloEB..12..177S |s2cid=85935463 |issn=1466-8238|doi-access=free}}</ref> A diversity index is a quantitative statistical measure of how many different types exist in a dataset, such as species in a community, accounting for ecological [[Species richness|richness]], [[Species evenness|evenness]], and [[Dominance (ecology)|dominance]]. Specifically, Shannon entropy is the logarithm of {{math|<sup>1</sup>D}}, the [[true diversity]] index with parameter equal to 1. The Shannon index is related to the proportional abundances of types.

===Entropy of a sequence===
There are a number of entropy-related concepts that mathematically quantify information content of a sequence or message:
* the '''[[self-information]]''' of an individual message or symbol taken from a given probability distribution (message or sequence seen as an individual event),
* the '''[[joint entropy]]''' of the symbols forming the message or sequence (seen as a set of events),
* the '''[[entropy rate]]''' of a [[stochastic process]] (message or sequence is seen as a succession of events).
(The "rate of self-information" can also be defined for a particular sequence of messages or symbols generated by a given stochastic process: this will always be equal to the entropy rate in the case of a [[stationary process]].) Other [[quantities of information]] are also used to compare or relate different sources of information.

It is important not to confuse the above concepts. Often it is only clear from context which one is meant. For example, when someone says that the "entropy" of the English language is about 1 bit per character, they are actually modeling the English language as a stochastic process and talking about its entropy ''rate''. Shannon himself used the term in this way.

If very large blocks are used, the estimate of per-character entropy rate may become artificially low because the probability distribution of the sequence is not known exactly; it is only an estimate. If one considers the text of every book ever published as a sequence, with each symbol being the text of a complete book, and if there are {{math|''N''}} published books, and each book is only published once, the estimate of the probability of each book is {{math|1/''N''}}, and the entropy (in bits) is {{math|−log<sub>2</sub>(1/''N'') {{=}} log<sub>2</sub>(''N'')}}. As a practical code, this corresponds to assigning each book a [[ISBN|unique identifier]] and using it in place of the text of the book whenever one wants to refer to the book. This is enormously useful for talking about books, but it is not so useful for characterizing the information content of an individual book, or of language in general: it is not possible to reconstruct the book from its identifier without knowing the probability distribution, that is, the complete text of all the books. The key idea is that the complexity of the probabilistic model must be considered. [[Kolmogorov complexity]] is a theoretical generalization of this idea that allows the consideration of the information content of a sequence independent of any particular probability model; it considers the shortest [[Computer program|program]] for a [[universal computer]] that outputs the sequence. A code that achieves the entropy rate of a sequence for a given model, plus the codebook (i.e. the probabilistic model), is one such program, but it may not be the shortest.

The Fibonacci sequence is 1, 1, 2, 3, 5, 8, 13, .... treating the sequence as a message and each number as a symbol, there are almost as many symbols as there are characters in the message, giving an entropy of approximately {{math|log<sub>2</sub>(''n'')}}. The first 128 symbols of the Fibonacci sequence has an entropy of approximately 7 bits/symbol, but the sequence can be expressed using a formula [{{math|F(''n'') {{=}} F(''n''−1) + F(''n''−2)}} for {{math|''n'' {{=}} 3, 4, 5, ...}}, {{math|F(1) {{=}}1}}, {{math|F(2) {{=}} 1}}] and this formula has a much lower entropy and applies to any length of the Fibonacci sequence.

===Limitations of entropy in cryptography===
In [[cryptanalysis]], entropy is often roughly used as a measure of the unpredictability of a cryptographic key, though its real [[Uncertainty principle|uncertainty]] is unmeasurable. For example, a 128-bit key that is uniformly and randomly generated has 128 bits of entropy. It also takes (on average) <math>2^{127}</math> guesses to break by brute force. Entropy fails to capture the number of guesses required if the possible keys are not chosen uniformly.<ref>{{cite conference |first1=James |last1=Massey |year=1994 |title=Guessing and Entropy |book-title=Proc. IEEE International Symposium on Information Theory |url=http://www.isiweb.ee.ethz.ch/archive/massey_pub/pdf/BI633.pdf |access-date=31 December 2013 |archive-date=1 January 2014 |archive-url=https://web.archive.org/web/20140101065020/http://www.isiweb.ee.ethz.ch/archive/massey_pub/pdf/BI633.pdf |url-status=live }}</ref><ref>{{cite conference |first1=David |last1=Malone |first2=Wayne |last2=Sullivan |year=2005 |title=Guesswork is not a Substitute for Entropy |book-title=Proceedings of the Information Technology & Telecommunications Conference |url=http://www.maths.tcd.ie/~dwmalone/p/itt05.pdf |access-date=31 December 2013 |archive-date=15 April 2016 |archive-url=https://web.archive.org/web/20160415054357/http://www.maths.tcd.ie/~dwmalone/p/itt05.pdf |url-status=live }}</ref> Instead, a measure called ''guesswork'' can be used to measure the effort required for a brute force attack.<ref>{{cite conference |first1=John |last1=Pliam |title=Selected Areas in Cryptography |year=1999 |chapter=Guesswork and variation distance as measures of cipher security|series=Lecture Notes in Computer Science |volume=1758 |pages=62–77 |book-title=International Workshop on Selected Areas in Cryptography |doi=10.1007/3-540-46513-8_5 |isbn=978-3-540-67185-5 |doi-access=free }}</ref>

Other problems may arise from non-uniform distributions used in cryptography. For example, a 1,000,000-digit binary [[one-time pad]] using exclusive or. If the pad has 1,000,000 bits of entropy, it is perfect. If the pad has 999,999 bits of entropy, evenly distributed (each individual bit of the pad having 0.999999 bits of entropy) it may provide good security. But if the pad has 999,999 bits of entropy, where the first bit is fixed and the remaining 999,999 bits are perfectly random, the first bit of the ciphertext will not be encrypted at all.

===Data as a Markov process===
A common way to define entropy for text is based on the [[Markov model]] of text. For an order-0 source (each character is selected independent of the last characters), the binary entropy is:

<math display="block">\Eta(\mathcal{S}) = - \sum_i p_i \log p_i ,</math>

where {{math|''p''<sub>''i''</sub>}} is the probability of {{math|''i''}}. For a first-order [[Markov source]] (one in which the probability of selecting a character is dependent only on the immediately preceding character), the '''[[entropy rate]]''' is:{{citation needed|date=April 2013}}

<math display="block">\Eta(\mathcal{S}) = - \sum_i p_i \sum_j  \  p_i (j) \log p_i (j) ,</math> 

where {{math|''i''}} is a '''state''' (certain preceding characters) and <math>p_i(j)</math> is the probability of {{math|''j''}} given {{math|''i''}} as the previous character.

For a second order Markov source, the entropy rate is

<math display="block">\Eta(\mathcal{S}) = -\sum_i p_i \sum_j p_i(j) \sum_k p_{i,j}(k)\ \log p_{i,j}(k) .</math>