Editing Information theory (section)

===Mutual information (transinformation)===
''[[Mutual information]]'' measures the amount of information that can be obtained about one random variable by observing another. It is important in communication where it can be used to maximize the amount of information shared between sent and received signals. The mutual information of {{math|''X''}} relative to {{math|''Y''}} is given by:

:<math>I(X;Y) = \mathbb{E}_{X,Y} [SI(x,y)] = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)\, p(y)}</math>
where {{math|SI}} (''S''pecific mutual Information) is the [[pointwise mutual information]].

A basic property of the mutual information is that
: <math>I(X;Y) = H(X) - H(X|Y).\,</math>
That is, knowing ''Y'', we can save an average of {{math|''I''(''X''; ''Y'')}} bits in encoding ''X'' compared to not knowing ''Y''.

Mutual information is [[symmetric function|symmetric]]:
: <math>I(X;Y) = I(Y;X) = H(X) + H(Y) - H(X,Y).\,</math>

Mutual information can be expressed as the average Kullback–Leibler divergence (information gain) between the [[posterior probability|posterior probability distribution]] of ''X'' given the value of ''Y'' and the [[prior probability|prior distribution]] on ''X'':
: <math>I(X;Y) = \mathbb E_{p(y)} [D_{\mathrm{KL}}( p(X|Y=y) \| p(X) )].</math>
In other words, this is a measure of how much, on the average, the probability distribution on ''X'' will change if we are given the value of ''Y''. This is often recalculated as the divergence from the product of the marginal distributions to the actual joint distribution:
: <math>I(X; Y) = D_{\mathrm{KL}}(p(X,Y) \| p(X)p(Y)).</math>

Mutual information is closely related to the [[likelihood-ratio test|log-likelihood ratio test]] in the context of contingency tables and the [[multinomial distribution]] and to [[Pearson's chi-squared test|Pearson's χ<sup>2</sup> test]]: mutual information can be considered a statistic for assessing independence between a pair of variables, and has a well-specified asymptotic distribution.