Editing Information theory (section)

===Kullback–Leibler divergence (information gain)===
The ''[[Kullback–Leibler divergence]]'' (or ''information divergence'', ''information gain'', or ''relative entropy'') is a way of comparing two distributions: a "true" [[probability distribution]] {{tmath|p(X)}}, and an arbitrary probability distribution {{tmath|q(X)}}. If we compress data in a manner that assumes {{tmath|q(X)}} is the distribution underlying some data, when, in reality, {{tmath|p(X)}} is the correct distribution, the Kullback–Leibler divergence is the number of average additional bits per datum necessary for compression. It is thus defined

:<math>D_{\mathrm{KL}}(p(X) \| q(X)) = \sum_{x \in X} -p(x) \log {q(x)} \, - \, \sum_{x \in X} -p(x) \log {p(x)} = \sum_{x \in X} p(x) \log \frac{p(x)}{q(x)}.</math>

Although it is sometimes used as a 'distance metric', KL divergence is not a true [[Metric (mathematics)|metric]] since it is not symmetric and does not satisfy the [[triangle inequality]] (making it a semi-quasimetric).

Another interpretation of the KL divergence is the "unnecessary surprise" introduced by a prior from the truth: suppose a number ''X'' is about to be drawn randomly from a discrete set with probability distribution {{tmath|p(x)}}. If Alice knows the true distribution {{tmath|p(x)}}, while Bob believes (has a [[prior probability|prior]]) that the distribution is {{tmath|q(x)}}, then Bob will be more [[Information content|surprised]] than Alice, on average, upon seeing the value of ''X''. The KL divergence is the (objective) expected value of Bob's (subjective) [[Information content|surprisal]] minus Alice's surprisal, measured in bits if the ''log'' is in base 2. In this way, the extent to which Bob's prior is "wrong" can be quantified in terms of how "unnecessarily surprised" it is expected to make him.