Editing Information theory (section)

==Quantities of information==
{{Unreferenced section|date=April 2024}}{{Main|Quantities of information}}

Information theory is based on [[probability theory]] and statistics, where [[Quantities of information|quantified information]] is usually described in terms of bits. Information theory often concerns itself with measures of information of the distributions associated with random variables. One of the most important measures is called [[Entropy (information theory)|entropy]], which forms the building block of many other measures. Entropy allows quantification of measure of information in a single random variable.<ref>{{Cite web |last=Braverman |first=Mark |date=September 19, 2011 |title=Information Theory in Computer Science |url=https://www.cs.princeton.edu/courses/archive/fall11/cos597D/L01.pdf }}</ref> Another useful concept is mutual information defined on two random variables, which describes the measure of information in common between those variables, which can be used to describe their correlation. The former quantity is a property of the probability distribution of a random variable and gives a limit on the rate at which data generated by independent samples with the given distribution can be reliably compressed. The latter is a property of the joint distribution of two random variables, and is the maximum rate of reliable communication across a noisy [[Communication channel|channel]] in the limit of long block lengths, when the channel statistics are determined by the joint distribution.

The choice of logarithmic base in the following formulae determines the [[units of measurement|unit]] of information entropy that is used. A common unit of information is the bit or [[shannon (unit)|shannon]], based on the [[binary logarithm]]. Other units include the [[nat (unit)|nat]], which is based on the [[natural logarithm]], and the [[deciban|decimal digit]], which is based on the [[common logarithm]].

In what follows, an expression of the form {{math|''p'' log ''p''}} is considered by convention to be equal to zero whenever {{math|1=''p'' = 0}}. This is justified because <math>\lim_{p \rightarrow 0+} p \log p = 0</math> for any logarithmic base.

===Entropy of an information source===
Based on the [[probability mass function]] of each source symbol to be communicated, the Shannon [[Entropy (information theory)|entropy]] {{math|''H''}}, in units of bits (per symbol), is given by
:<math>H = - \sum_{i} p_i \log_2 (p_i)</math>
where {{math|''p<sub>i</sub>''}} is the probability of occurrence of the {{math|''i''}}-th possible value of the source symbol. This equation gives the entropy in the units of "bits" (per symbol) because it uses a logarithm of base 2, and this base-2 measure of entropy has sometimes been called the [[Shannon (unit)|shannon]] in his honor. Entropy is also commonly computed using the natural logarithm (base {{mvar|[[E (mathematical constant)|e]]}}, where {{mvar|e}} is Euler's number), which produces a measurement of entropy in nats per symbol and sometimes simplifies the analysis by avoiding the need to include extra constants in the formulas. Other bases are also possible, but less commonly used. For example, a logarithm of base {{nowrap|1=2<sup>8</sup> = 256}} will produce a measurement in [[byte]]s per symbol, and a logarithm of base 10 will produce a measurement in decimal digits (or [[Hartley (unit)|hartleys]]) per symbol.

Intuitively, the entropy {{math|''H<sub>X</sub>''}} of a discrete random variable {{math|''X''}} is a measure of the amount of ''uncertainty'' associated with the value of {{math|''X''}} when only its distribution is known.

The entropy of a source that emits a sequence of {{math|''N''}} symbols that are [[independent and identically distributed]] (iid) is {{math|''N'' ⋅ ''H''}} bits (per message of {{math|''N''}} symbols). If the source data symbols are identically distributed but not independent, the entropy of a message of length {{math|''N''}} will be less than {{math|''N'' ⋅ ''H''}}.

[[File:Binary entropy plot.svg|thumbnail|right|200px|The entropy of a [[Bernoulli trial]] as a function of success probability, often called the {{em|[[binary entropy function]]}}, {{math|''H''<sub>b</sub>(''p'')}}. The entropy is maximized at 1 bit per trial when the two possible outcomes are equally probable, as in an unbiased coin toss.]]

If one transmits 1000 bits (0s and 1s), and the value of each of these bits is known to the receiver (has a specific value with certainty) ahead of transmission, it is clear that no information is transmitted. If, however, each bit is independently equally likely to be 0 or 1, 1000 shannons of information (more often called bits) have been transmitted. Between these two extremes, information can be quantified as follows. If <math>\mathbb{X}</math> is the set of all messages {{math|{{mset|''x''<sub>1</sub>, ..., ''x''<sub>''n''</sub>}}}} that {{math|''X''}} could be, and {{math|''p''(''x'')}} is the probability of some <math>x \in \mathbb X</math>, then the entropy, {{math|''H''}}, of {{math|''X''}} is defined:{{sfn|Reza|1994}}

:<math> H(X) = \mathbb{E}_{X} [I(x)] = -\sum_{x \in \mathbb{X}} p(x) \log p(x).</math>

(Here, {{math|''I''(''x'')}} is the [[self-information]], which is the entropy contribution of an individual message, and <math>\mathbb{E}_X</math> is the [[expected value]].) A property of entropy is that it is maximized when all the messages in the message space are equiprobable {{math|1=''p''(''x'') = 1/''n''}}; i.e., most unpredictable, in which case {{math|1=''H''(''X'') = log ''n''}}.

The special case of information entropy for a random variable with two outcomes is the binary entropy function, usually taken to the logarithmic base 2, thus having the [[Shannon (unit)|shannon]] (Sh) as unit:

:<math>H_{\mathrm{b}}(p) = - p \log_2 p - (1-p)\log_2 (1-p).</math>

===Joint entropy===
The {{em|[[joint entropy]]}} of two discrete random variables {{math|''X''}} and {{math|''Y''}} is merely the entropy of their pairing: {{math|(''X'', ''Y'')}}. This implies that if {{math|''X''}} and {{math|''Y''}} are [[statistical independence|independent]], then their joint entropy is the sum of their individual entropies.

For example, if {{math|(''X'', ''Y'')}} represents the position of a chess piece—{{math|''X''}} the row and {{math|''Y''}} the column, then the joint entropy of the row of the piece and the column of the piece will be the entropy of the position of the piece.

:<math>H(X, Y) = \mathbb{E}_{X,Y} [-\log p(x,y)] = - \sum_{x, y} p(x, y) \log p(x, y) \,</math>

Despite similar notation, joint entropy should not be confused with {{em|[[cross-entropy]]}}.

===Conditional entropy (equivocation)===
The {{em|[[conditional entropy]]}} or ''conditional uncertainty'' of {{math|''X''}} given random variable {{math|''Y''}} (also called the ''equivocation'' of {{math|''X''}} about {{math|''Y''}}) is the average conditional entropy over {{math|''Y''}}:{{sfn|Ash|1990}}

:<math> H(X|Y) = \mathbb E_Y [H(X|y)] = -\sum_{y \in Y} p(y) \sum_{x \in X} p(x|y) \log p(x|y) = -\sum_{x,y} p(x,y) \log p(x|y).</math>

Because entropy can be conditioned on a random variable or on that random variable being a certain value, care should be taken not to confuse these two definitions of conditional entropy, the former of which is in more common use. A basic property of this form of conditional entropy is that:

: <math> H(X|Y) = H(X,Y) - H(Y) .\,</math>

===Mutual information (transinformation)===
''[[Mutual information]]'' measures the amount of information that can be obtained about one random variable by observing another. It is important in communication where it can be used to maximize the amount of information shared between sent and received signals. The mutual information of {{math|''X''}} relative to {{math|''Y''}} is given by:

:<math>I(X;Y) = \mathbb{E}_{X,Y} [SI(x,y)] = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)\, p(y)}</math>
where {{math|SI}} (''S''pecific mutual Information) is the [[pointwise mutual information]].

A basic property of the mutual information is that
: <math>I(X;Y) = H(X) - H(X|Y).\,</math>
That is, knowing ''Y'', we can save an average of {{math|''I''(''X''; ''Y'')}} bits in encoding ''X'' compared to not knowing ''Y''.

Mutual information is [[symmetric function|symmetric]]:
: <math>I(X;Y) = I(Y;X) = H(X) + H(Y) - H(X,Y).\,</math>

Mutual information can be expressed as the average Kullback–Leibler divergence (information gain) between the [[posterior probability|posterior probability distribution]] of ''X'' given the value of ''Y'' and the [[prior probability|prior distribution]] on ''X'':
: <math>I(X;Y) = \mathbb E_{p(y)} [D_{\mathrm{KL}}( p(X|Y=y) \| p(X) )].</math>
In other words, this is a measure of how much, on the average, the probability distribution on ''X'' will change if we are given the value of ''Y''. This is often recalculated as the divergence from the product of the marginal distributions to the actual joint distribution:
: <math>I(X; Y) = D_{\mathrm{KL}}(p(X,Y) \| p(X)p(Y)).</math>

Mutual information is closely related to the [[likelihood-ratio test|log-likelihood ratio test]] in the context of contingency tables and the [[multinomial distribution]] and to [[Pearson's chi-squared test|Pearson's χ<sup>2</sup> test]]: mutual information can be considered a statistic for assessing independence between a pair of variables, and has a well-specified asymptotic distribution.

===Kullback–Leibler divergence (information gain)===
The ''[[Kullback–Leibler divergence]]'' (or ''information divergence'', ''information gain'', or ''relative entropy'') is a way of comparing two distributions: a "true" [[probability distribution]] {{tmath|p(X)}}, and an arbitrary probability distribution {{tmath|q(X)}}. If we compress data in a manner that assumes {{tmath|q(X)}} is the distribution underlying some data, when, in reality, {{tmath|p(X)}} is the correct distribution, the Kullback–Leibler divergence is the number of average additional bits per datum necessary for compression. It is thus defined

:<math>D_{\mathrm{KL}}(p(X) \| q(X)) = \sum_{x \in X} -p(x) \log {q(x)} \, - \, \sum_{x \in X} -p(x) \log {p(x)} = \sum_{x \in X} p(x) \log \frac{p(x)}{q(x)}.</math>

Although it is sometimes used as a 'distance metric', KL divergence is not a true [[Metric (mathematics)|metric]] since it is not symmetric and does not satisfy the [[triangle inequality]] (making it a semi-quasimetric).

Another interpretation of the KL divergence is the "unnecessary surprise" introduced by a prior from the truth: suppose a number ''X'' is about to be drawn randomly from a discrete set with probability distribution {{tmath|p(x)}}. If Alice knows the true distribution {{tmath|p(x)}}, while Bob believes (has a [[prior probability|prior]]) that the distribution is {{tmath|q(x)}}, then Bob will be more [[Information content|surprised]] than Alice, on average, upon seeing the value of ''X''. The KL divergence is the (objective) expected value of Bob's (subjective) [[Information content|surprisal]] minus Alice's surprisal, measured in bits if the ''log'' is in base 2. In this way, the extent to which Bob's prior is "wrong" can be quantified in terms of how "unnecessarily surprised" it is expected to make him.

===Directed Information===
''[[Directed information]]'', <math>I(X^n\to Y^n) </math>, is an information theory measure that quantifies the [[information]] flow from the random process <math>X^n = \{X_1,X_2,\dots,X_n\}</math> to the random process <math>Y^n = \{Y_1,Y_2,\dots,Y_n\}</math>. The term ''directed information'' was coined by [[James Massey]] and is defined as
:<math>I(X^n\to Y^n) \triangleq \sum_{i=1}^n I(X^i;Y_i|Y^{i-1})</math>,
where <math>I(X^{i};Y_i|Y^{i-1})</math> is the [[conditional mutual information]] <math>I(X_1,X_2,...,X_{i};Y_i|Y_1,Y_2,...,Y_{i-1})</math>.

In contrast to ''mutual'' information, ''directed'' information is not symmetric. The <math>I(X^n\to Y^n) </math> measures the information bits that are transmitted causally{{clarify|date=November 2024|reason=definition of causal transmission?}} from <math>X^n</math> to <math>Y^n</math>. The Directed information has many applications in problems where [[causality]] plays an important role such as [[channel capacity|capacity of channel]] with feedback,<ref name=massey>{{citation |last1=Massey|first1=James|contribution=Causality, Feedback And Directed Information|date=1990|title=Proc. 1990 Intl. Symp. on Info. Th. and its Applications|citeseerx=10.1.1.36.5688}}</ref><ref>{{cite journal|last1=Permuter|first1=Haim Henry|last2=Weissman|first2=Tsachy|last3=Goldsmith|first3=Andrea J.|title=Finite State Channels With Time-Invariant Deterministic Feedback|journal=IEEE Transactions on Information Theory|date=February 2009|volume=55|issue=2|pages=644–662|doi=10.1109/TIT.2008.2009849|arxiv=cs/0608070|s2cid=13178}}</ref> capacity of discrete [[memoryless]] networks with feedback,<ref>{{cite journal|last1=Kramer|first1=G.|title=Capacity results for the discrete memoryless network|journal=IEEE Transactions on Information Theory|date=January 2003|volume=49|issue=1|pages=4–21|doi=10.1109/TIT.2002.806135}}</ref> [[sports gambling|gambling]] with causal side information,<ref>{{cite journal|last1=Permuter|first1=Haim H.|last2=Kim|first2=Young-Han|last3=Weissman|first3=Tsachy|title=Interpretations of Directed Information in Portfolio Theory, Data Compression, and Hypothesis Testing|journal=IEEE Transactions on Information Theory|date=June 2011|volume=57|issue=6|pages=3248–3259|doi=10.1109/TIT.2011.2136270|arxiv=0912.4872|s2cid=11722596}}</ref> [[Data compression|compression]] with causal side information,<ref>{{cite journal|last1=Simeone|first1=Osvaldo|last2=Permuter|first2=Haim Henri|title=Source Coding When the Side Information May Be Delayed|journal=IEEE Transactions on Information Theory|date=June 2013|volume=59|issue=6|pages=3607–3618|doi=10.1109/TIT.2013.2248192|arxiv=1109.1293|s2cid=3211485}}</ref>
[[real-time control]] communication settings,<ref>{{cite journal|last1=Charalambous|first1=Charalambos D.|last2=Stavrou|first2=Photios A.|title=Directed Information on Abstract Spaces: Properties and Variational Equalities|journal=IEEE Transactions on Information Theory|date=August 2016|volume=62|issue=11|pages=6019–6052|doi=10.1109/TIT.2016.2604846|arxiv=1302.3971|s2cid=8107565}}</ref><ref>{{cite journal |last1=Tanaka |first1=Takashi |last2=Esfahani |first2=Peyman Mohajerin |last3=Mitter |first3=Sanjoy K. |title=LQG Control With Minimum Directed Information: Semidefinite Programming Approach |journal=IEEE Transactions on Automatic Control |date=January 2018 |volume=63 |issue=1 |pages=37–52 |doi=10.1109/TAC.2017.2709618|s2cid=1401958 |url=http://resolver.tudelft.nl/uuid:d9db1c11-fbfd-4c0c-b66f-f341b49fa61a |arxiv=1510.04214 |via=TU Delft Repositories |url-status=live |archive-url=https://web.archive.org/web/20240412014101/https://repository.tudelft.nl/islandora/object/uuid:d9db1c11-fbfd-4c0c-b66f-f341b49fa61a/datastream/OBJ/download |archive-date= Apr 12, 2024  }}</ref> and in statistical physics.<ref>{{cite journal |last1=Vinkler |first1=Dror A |last2=Permuter |first2=Haim H |last3=Merhav |first3=Neri |title=Analogy between gambling and measurement-based work extraction |journal=Journal of Statistical Mechanics: Theory and Experiment |date=20 April 2016 |volume=2016 |issue=4 |pages=043403 |doi=10.1088/1742-5468/2016/04/043403|arxiv=1404.6788 |bibcode=2016JSMTE..04.3403V |s2cid=124719237 }}</ref>

===Other quantities===
Other important information theoretic quantities include the [[Rényi entropy]] and the [[Tsallis entropy]] (generalizations of the concept of entropy), [[differential entropy]] (a generalization of quantities of information to continuous distributions), and the [[conditional mutual information]]. Also, [[pragmatic information]] has been proposed as a measure of how much information has been used in making a decision.