Kullback–Leibler divergence

Template:Short description In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence<ref name=Csiszar>Template:Cite journal</ref>), denoted <math>D_\text{KL}(P \parallel Q)</math>, is a type of statistical distance: a measure of how much a model probability distribution Template:Mvar is different from a true probability distribution Template:Mvar.<ref name="KullbackLeibler1951">Template:Cite journal</ref>Template:Sfn Mathematically, it is defined as

<math display="block">D_\text{KL}(P \parallel Q) = \sum_{ x \in \mathcal{X} } P(x) \, \log \frac{ P(x) }{ Q(x) }.</math>

A simple interpretation of the KL divergence of Template:Mvar from Template:Mvar is the expected excess surprise from using Template:Mvar as a model instead of Template:Mvar when the actual distribution is Template:Mvar. While it is a measure of how different two distributions are and is thus a distance in some sense, it is not actually a metric, which is the most familiar and formal type of distance. In particular, it is not symmetric in the two distributions (in contrast to variation of information), and does not satisfy the triangle inequality. Instead, in terms of information geometry, it is a type of divergence,Template:Sfn a generalization of squared distance, and for certain classes of distributions (notably an exponential family), it satisfies a generalized Pythagorean theorem (which applies to squared distances).Template:Sfn

Relative entropy is always a non-negative real number, with value 0 if and only if the two distributions in question are identical. It has diverse applications, both theoretical, such as characterizing the relative (Shannon) entropy in information systems, randomness in continuous time-series, and information gain when comparing statistical models of inference; and practical, such as applied statistics, fluid mechanics, neuroscience, bioinformatics, and machine learning.

Template:Toclimit

Introduction and contextEdit

Consider two probability distributions Template:Mvar and Template:Mvar. Usually, Template:Mvar represents the data, the observations, or a measured probability distribution. Distribution Template:Mvar represents instead a theory, a model, a description or an approximation of Template:Mvar. The Kullback–Leibler divergence <math>D_\text{KL}(P \parallel Q)</math> is then interpreted as the average difference of the number of bits required for encoding samples of Template:Mvar using a code optimized for Template:Mvar rather than one optimized for Template:Mvar. Note that the roles of Template:Mvar and Template:Mvar can be reversed in some situations where that is easier to compute, such as with the expectation–maximization algorithm (EM) and evidence lower bound (ELBO) computations.

EtymologyEdit

The relative entropy was introduced by Solomon Kullback and Richard Leibler in Template:Harvtxt as "the mean information for discrimination between <math>H_1</math> and <math>H_2</math> per observation from <math>\mu_1</math>",Template:Sfn where one is comparing two probability measures <math>\mu_1, \mu_2</math>, and <math>H_1, H_2</math> are the hypotheses that one is selecting from measure <math>\mu_1, \mu_2</math> (respectively). They denoted this by <math>I(1:2)</math>, and defined the "'divergence' between <math>\mu_1</math> and <math>\mu_2</math>" as the symmetrized quantity <math>J(1,2) = I(1:2) + I(2:1)</math>, which had already been defined and used by Harold Jeffreys in 1948.Template:Sfn In Template:Harvtxt, the symmetrized form is again referred to as the "divergence", and the relative entropies in each direction are referred to as a "directed divergences" between two distributions;Template:Sfn Kullback preferred the term discrimination information.<ref name="Kullback1987">Template:Cite journal</ref> The term "divergence" is in contrast to a distance (metric), since the symmetrized divergence does not satisfy the triangle inequality.Template:Sfn Numerous references to earlier uses of the symmetrized divergence and to other statistical distances are given in Template:Harvtxt. The asymmetric "directed divergence" has come to be known as the Kullback–Leibler divergence, while the symmetrized "divergence" is now referred to as the Jeffreys divergence.

DefinitionEdit

For discrete probability distributions Template:Mvar and Template:Mvar defined on the same sample space, Template:Nowrap the relative entropy from Template:Mvar to Template:Mvar is defined<ref name=MacKey2003>Template:Cite book</ref> to be

<math display="block"> D_\text{KL}(P \parallel Q) = \sum_{ x \in \mathcal{X} } P(x) \, \log\frac{ P(x) }{ Q(x) } \, ,</math>

which is equivalent to

<math display="block"> D_\text{KL}(P \parallel Q) = -\sum_{ x \in \mathcal{X} } P(x) \, \log\frac{ Q(x) }{P(x)} \,.</math>

In other words, it is the expectation of the logarithmic difference between the probabilities Template:Mvar and Template:Mvar, where the expectation is taken using the probabilities Template:Mvar.

Relative entropy is only defined in this way if, for all Template:Mvar, <math> Q(x) = 0 </math> implies <math> P(x) = 0 </math> (absolute continuity). Otherwise, it is often defined as Template:Nobr but the value <math>\ +\infty\ </math> is possible even if <math> Q(x) \ne 0 </math> everywhere,<ref>{{#invoke:citation/CS1|citation |CitationClass=web }}</ref><ref>{{#invoke:citation/CS1|citation |CitationClass=web }}</ref> provided that <math> \mathcal{X} </math> is infinite in extent. Analogous comments apply to the continuous and general measure cases defined below.

Whenever <math> P(x) </math> is zero the contribution of the corresponding term is interpreted as zero because

For distributions Template:Mvar and Template:Mvar of a continuous random variable, relative entropy is defined to be the integral<ref>Template:Cite book</ref>

<math display="block"> D_\text{KL}(P \parallel Q) = \int_{-\infty}^\infty p(x) \, \log\frac{p(x)}{q(x)} \, dx \, ,</math>

where Template:Mvar and Template:Mvar denote the probability densities of Template:Mvar and Template:Mvar.

More generally, if Template:Mvar and Template:Mvar are probability measures on a measurable space <math>\mathcal{X}\, ,</math> and Template:Mvar is absolutely continuous with respect to Template:Mvar, then the relative entropy from Template:Mvar to Template:Mvar is defined as

<math display="block">D_\text{KL}(P \parallel Q) = \int_{ x \in \mathcal{X} } \log \frac{P(dx)}{Q(dx)} \, P(dx)\, ,</math>

where <math>\frac{ P(dx) }{Q(dx) }</math> is the Radon–Nikodym derivative of Template:Mvar with respect to Template:Mvar, i.e. the unique Template:Mvar almost everywhere defined function Template:Mvar on <math> \mathcal{X} </math> such that <math> P(dx) = r(x) Q(dx) </math> which exists because Template:Mvar is absolutely continuous with respect to Template:Mvar. Also we assume the expression on the right-hand side exists. Equivalently (by the chain rule), this can be written as

<math display="block"> D_\text{KL}(P \parallel Q) = \int_{ x \in \mathcal{X} } \frac{P(dx)}{Q(dx)}\ \log\frac{P(dx)}{Q(dx)} \ Q(dx) \, ,</math>

which is the entropy of Template:Mvar relative to Template:Mvar. Continuing in this case, if <math>\mu</math> is any measure on <math>\mathcal{X}</math> for which densities Template:Mvar and Template:Mvar with <math> P(dx) = p(x) \mu(dx) </math> and <math> Q(dx) = q(x) \mu(dx) </math> exist (meaning that Template:Mvar and Template:Mvar are both absolutely continuous with respect to Template:Nowrap then the relative entropy from Template:Mvar to Template:Mvar is given as

<math display="block"> D_\text{KL}(P \parallel Q) = \int_{x \in \mathcal{X}} p(x)\, \log\frac{p(x)}{q(x)} \ \mu(dx) \,.</math>

Note that such a measure <math>\mu</math> for which densities can be defined always exists, since one can take <math display="inline"> \mu = \frac{1}{2} \left( P + Q \right) </math> although in practice it will usually be one that applies in the context like counting measure for discrete distributions, or Lebesgue measure or a convenient variant thereof like Gaussian measure or the uniform measure on the sphere, Haar measure on a Lie group etc. for continuous distributions. The logarithms in these formulae are usually taken to base 2 if information is measured in units of bits, or to base Template:Mvar if information is measured in nats. Most formulas involving relative entropy hold regardless of the base of the logarithm.

Various conventions exist for referring to <math> D_\text{KL}(P \parallel Q) </math> in words. Often it is referred to as the divergence between Template:Mvar and Template:Mvar, but this fails to convey the fundamental asymmetry in the relation. Sometimes, as in this article, it may be described as the divergence of Template:Mvar from Template:Mvar or as the divergence from Template:Mvar to Template:Mvar. This reflects the asymmetry in Bayesian inference, which starts from a prior Template:Mvar and updates to the posterior Template:Mvar. Another common way to refer to <math> D_\text{KL}(P \parallel Q) </math> is as the relative entropy of Template:Mvar with respect to Template:Mvar or the information gain from Template:Mvar over Template:Mvar.

Basic exampleEdit

KullbackTemplate:Sfn gives the following example (Table 2.1, Example 2.1). Let Template:Mvar and Template:Mvar be the distributions shown in the table and figure. Template:Mvar is the distribution on the left side of the figure, a binomial distribution with <math>N = 2</math> and <math>p = 0.4</math>. Template:Mvar is the distribution on the right side of the figure, a discrete uniform distribution with the three possible outcomes Template:Math (i.e. <math>\mathcal{X}=\{0,1,2\}</math>), each with probability <math>p = 1/3</math>.

File:Kullback–Leibler distributions example 1.svg

Two distributions to illustrate relative entropy

Template:Diagonal split header	0	1	2
<math>P(x)</math>	Template:Sfrac	Template:Sfrac	Template:Sfrac
<math>Q(x)</math>	Template:Sfrac	Template:Sfrac	Template:Sfrac

Relative entropies <math>D_\text{KL}(P \parallel Q)</math> and <math>D_\text{KL}(Q \parallel P)</math> are calculated as follows. This example uses the natural log with base [[E (mathematical constant)|Template:Mvar]], designated Template:Math to get results in nats (see units of information):

<math display="block">\begin{align} D_\text{KL}(P \parallel Q) &= \sum_{x\in\mathcal{X}} P(x) \, \ln\frac{P(x)}{Q(x)} \\

&= \frac{9}{25} \ln\frac{9/25}{1/3}
 + \frac{12}{25} \ln\frac{12/25}{1/3}
 + \frac{4}{25}  \ln\frac{4/25}{1/3} \\

& = \frac{1}{25} \left(32 \ln 2 + 55 \ln 3 - 50 \ln 5 \right) \\ &\approx 0.0852996, \end{align}</math>

<math display="block">\begin{align} D_\text{KL}(Q \parallel P) &= \sum_{x\in\mathcal{X}} Q(x) \, \ln\frac{Q(x)}{P(x)} \\

&= \frac{1}{3} \, \ln\frac{1/3}{9/25}
 + \frac{1}{3} \, \ln\frac{1/3}{12/25}
 + \frac{1}{3} \, \ln\frac{1/3}{4/25} \\
&= \frac{1}{3} \left(-4 \ln 2 - 6 \ln 3 + 6 \ln 5 \right) \\
&\approx 0.097455.

\end{align}</math>

InterpretationsEdit

StatisticsEdit

In the field of statistics, the Neyman–Pearson lemma states that the most powerful way to distinguish between the two distributions Template:Mvar and Template:Mvar based on an observation Template:Mvar (drawn from one of them) is through the log of the ratio of their likelihoods: <math>\log P(Y) - \log Q(Y)</math>. The KL divergence is the expected value of this statistic if Template:Mvar is actually drawn from Template:Mvar. Kullback motivated the statistic as an expected log likelihood ratio.Template:Sfn

CodingEdit

In the context of coding theory, <math>D_\text{KL}(P \parallel Q)</math> can be constructed by measuring the expected number of extra bits required to code samples from Template:Mvar using a code optimized for Template:Mvar rather than the code optimized for Template:Mvar.

InferenceEdit

In the context of machine learning, <math>D_\text{KL}(P \parallel Q)</math> is often called the information gain achieved if Template:Mvar would be used instead of Template:Mvar which is currently used. By analogy with information theory, it is called the relative entropy of Template:Mvar with respect to Template:Mvar.

Expressed in the language of Bayesian inference, <math>D_\text{KL}(P \parallel Q)</math> is a measure of the information gained by revising one's beliefs from the prior probability distribution Template:Mvar to the posterior probability distribution Template:Mvar. In other words, it is the amount of information lost when Template:Mvar is used to approximate Template:Mvar.<ref>Template:Cite book</ref>

Information geometryEdit

In applications, Template:Mvar typically represents the "true" distribution of data, observations, or a precisely calculated theoretical distribution, while Template:Mvar typically represents a theory, model, description, or approximation of Template:Mvar. In order to find a distribution Template:Mvar that is closest to Template:Mvar, we can minimize the KL divergence and compute an information projection.

While it is a statistical distance, it is not a metric, the most familiar type of distance, but instead it is a divergence.Template:Sfn While metrics are symmetric and generalize linear distance, satisfying the triangle inequality, divergences are asymmetric and generalize squared distance, in some cases satisfying a generalized Pythagorean theorem. In general <math>D_\text{KL}(P \parallel Q)</math> does not equal <math>D_\text{KL}(Q \parallel P)</math>, and the asymmetry is an important part of the geometry.Template:Sfn The infinitesimal form of relative entropy, specifically its Hessian, gives a metric tensor that equals the Fisher information metric; see Template:Slink. Fisher information metric on the certain probability distribution let determine the natural gradient for information-geometric optimization algorithms.<ref>Template:Cite journal</ref> Its quantum version is Fubini-study metric.<ref>Template:Cite journal</ref> Relative entropy satisfies a generalized Pythagorean theorem for exponential families (geometrically interpreted as dually flat manifolds), and this allows one to minimize relative entropy by geometric means, for example by information projection and in maximum likelihood estimation.Template:Sfn

The relative entropy is the Bregman divergence generated by the negative entropy, but it is also of the form of an [[f-divergence|Template:Mvar-divergence]]. For probabilities over a finite alphabet, it is unique in being a member of both of these classes of statistical divergences. The application of Bregman divergence can be found in mirror descent.<ref>Template:Cite journal</ref>

Finance (game theory)Edit

Consider a growth-optimizing investor in a fair game with mutually exclusive outcomes (e.g. a “horse race” in which the official odds add up to one). The rate of return expected by such an investor is equal to the relative entropy between the investor's believed probabilities and the official odds.<ref>Template:Cite journal</ref> This is a special case of a much more general connection between financial returns and divergence measures.<ref>Template:Cite journal</ref>

Financial risks are connected to <math>D_ \text{KL} </math> via information geometry.<ref>Template:Cite journal</ref> Investors' views, the prevailing market view, and risky scenarios form triangles on the relevant manifold of probability distributions. The shape of the triangles determines key financial risks (both qualitatively and quantitatively). For instance, obtuse triangles in which investors' views and risk scenarios appear on “opposite sides” relative to the market describe negative risks, acute triangles describe positive exposure, and the right-angled situation in the middle corresponds to zero risk. Extending this concept, relative entropy can be hypothetically utilised to identify the behaviour of informed investors, if one takes this to be represented by the magnitude and deviations away from the prior expectations of fund flows, for example.<ref>Template:Cite journal</ref>

MotivationEdit

File:KL-Gauss-Example.png

Illustration of the relative entropy for two normal distributions. The typical asymmetry is clearly visible.

In information theory, the Kraft–McMillan theorem establishes that any directly decodable coding scheme for coding a message to identify one value <math>x_i</math> out of a set of possibilities Template:Mvar can be seen as representing an implicit probability distribution <math>q(x_i)=2^{-\ell_i}</math> over Template:Mvar, where <math>\ell_i</math> is the length of the code for <math>x_i</math> in bits. Therefore, relative entropy can be interpreted as the expected extra message-length per datum that must be communicated if a code that is optimal for a given (wrong) distribution Template:Mvar is used, compared to using a code based on the true distribution Template:Mvar: it is the excess entropy.

<math display="block">\begin{align} D_\text{KL}(P\parallel Q) &= \sum_{x\in\mathcal{X}} p(x) \log \frac{1}{q(x)} - \sum_{x\in\mathcal{X}} p(x) \log \frac{1}{p(x)} \\[5pt]

&= \Eta(P, Q) - \Eta(P)

\end{align}</math>

where <math>\Eta(P,Q)</math> is the cross entropy of Template:Mvar relative to Template:Mvar and <math>\Eta(P)</math> is the entropy of Template:Mvar (which is the same as the cross-entropy of P with itself).

The relative entropy <math>D_\text{KL}(P \parallel Q)</math> can be thought of geometrically as a statistical distance, a measure of how far the distribution Template:Mvar is from the distribution Template:Mvar. Geometrically it is a divergence: an asymmetric, generalized form of squared distance. The cross-entropy <math>H(P,Q)</math> is itself such a measurement (formally a loss function), but it cannot be thought of as a distance, since <math>H(P,P)=:H(P)</math> is not zero. This can be fixed by subtracting <math>H(P)</math> to make <math>D_\text{KL}(P \parallel Q)</math> agree more closely with our notion of distance, as the excess loss. The resulting function is asymmetric, and while this can be symmetrized (see Template:Slink), the asymmetric form is more useful. See Template:Slink for more on the geometric interpretation.

Relative entropy relates to "rate function" in the theory of large deviations.<ref name="Sanov">Template:Cite journal</ref><ref name="Novak">Novak S.Y. (2011), Extreme Value Methods with Applications to Finance ch. 14.5 (Chapman & Hall). Template:Isbn.</ref>

Arthur Hobson proved that relative entropy is the only measure of difference between probability distributions that satisfies some desired properties, which are the canonical extension to those appearing in a commonly used characterization of entropy.<ref>Template:Cite book</ref> Consequently, mutual information is the only measure of mutual dependence that obeys certain related conditions, since it can be defined in terms of Kullback–Leibler divergence.

PropertiesEdit

Relative entropy is always non-negative, <math display="block">D_\text{KL}(P \parallel Q) \geq 0,</math> a result known as Gibbs' inequality, with <math>D_\text{KL}(P\parallel Q)</math> equals zero if and only if <math>P = Q</math> as measures.

In particular, if <math>P(dx) = p(x) \mu(dx)</math> and <math>Q(dx) = q(x)\mu(dx)</math>, then <math>p(x) = q(x)</math> <math>\mu</math>-almost everywhere. The entropy <math>\Eta(P)</math> thus sets a minimum value for the cross-entropy <math>\Eta(P, Q)</math>, the expected number of bits required when using a code based on Template:Mvar rather than Template:Mvar; and the Kullback–Leibler divergence therefore represents the expected number of extra bits that must be transmitted to identify a value Template:Mvar drawn from Template:Mvar, if a code is used corresponding to the probability distribution Template:Mvar, rather than the "true" distribution Template:Mvar.

No upper-bound exists for the general case. However, it is shown that if Template:Mvar and Template:Mvar are two discrete probability distributions built by distributing the same discrete quantity, then the maximum value of <math>D_\text{KL}(P\parallel Q)</math> can be calculated.<ref name="Bonnici2020">Template:Cite arXiv</ref>
Relative entropy remains well-defined for continuous distributions, and furthermore is invariant under parameter transformations. For example, if a transformation is made from variable Template:Mvar to variable <math>y(x)</math>, then, since <math>P(dx) = p(x) \, dx = \tilde{p}(y) \, dy = \tilde{p}(y(x)) \left|\tfrac{dy}{dx}(x)\right| \,dx </math> and <math>Q(dx)= q(x)\, dx = \tilde{q}(y) \, dy = \tilde{q}(y) \left|\tfrac{dy}{dx}(x)\right| dx</math> where <math>\left|\tfrac{dy}{dx}(x)\right|</math> is the absolute value of the derivative or more generally of the Jacobian, the relative entropy may be rewritten: <math display="block">\begin{align}

D_\text{KL}(P \parallel Q)
&= \int_{x_a}^{x_b} p(x) \, \log\frac{p(x)}{q(x)} \, dx \\[6pt]
&= \int_{x_a}^{x_b} \tilde{p}(y(x)) \left|\frac{dy}{dx}\right| \log\frac{\tilde{p}(y(x))\, \left|\frac{dy}{dx}\right|}{\tilde{q}(y(x))\, \left|\frac{dy}{dx}\right|}\, dx \\
&= \int_{y_a}^{y_b} \tilde{p}(y) \, \log\frac{\tilde{p}(y)}{\tilde{q}(y)} \, dy

\end{align}</math> where <math>y_a = y(x_a)</math> and <math>y_b = y(x_b)</math>. Although it was assumed that the transformation was continuous, this need not be the case. This also shows that the relative entropy produces a dimensionally consistent quantity, since if Template:Mvar is a dimensioned variable, <math>p(x)</math> and <math>q(x)</math> are also dimensioned, since e.g. <math>P(dx) = p(x) \, dx</math> is dimensionless. The argument of the logarithmic term is and remains dimensionless, as it must. It can therefore be seen as in some ways a more fundamental quantity than some other properties in information theory<ref name="VerduLecture">See the section "differential entropy – 4" in Relative Entropy video lecture by Sergio Verdú NIPS 2009</ref> (such as self-information or Shannon entropy), which can become undefined or negative for non-discrete probabilities.

Relative entropy is additive for independent distributions in much the same way as Shannon entropy. If <math>P_1, P_2</math> are independent distributions, and <math>P(dx, dy) = P_1(dx)P_2(dy)</math>, and likewise <math>Q(dx, dy) = Q_1(dx)Q_2(dy)</math> for independent distributions <math>Q_1, Q_2</math> then <math display="block"> D_\text{KL}(P \parallel Q) = D_\text{KL}(P_1 \parallel Q_1) + D_\text{KL}(P_2 \parallel Q_2).</math>
Relative entropy <math> D_\text{KL}(P \parallel Q)</math> is convex in the pair of probability measures <math>(P,Q)</math>, i.e. if <math>(P_1,Q_1)</math> and <math>(P_2,Q_2)</math> are two pairs of probability measures then <math display="block">D_\text{KL}(\lambda P_1 + (1 - \lambda) P_2 \parallel \lambda Q_1 + (1 - \lambda) Q_2) \le

\lambda D_\text{KL}(P_1 \parallel Q_1) + (1 - \lambda)D_\text{KL}(P_2 \parallel Q_2) \text{ for } 0 \le \lambda \le 1.</math>

<math>D_\text{KL}(P \parallel Q)</math> may be Taylor expanded about its minimum (i.e. <math>P=Q</math>) as <math display="block">D_\text{KL}(P \parallel Q) = \sum_{n=2}^\infty \frac {1}{n(n-1)} \sum_{x \in \mathcal{X}} \frac{(Q(x) - P(x))^n}{Q(x)^{n-1}} </math> which converges if and only if <math>P \leq 2Q</math> almost surely w.r.t <math>Q</math>.

Template:Hidden begin Denote <math>f(\alpha) := D_\text{KL}((1-\alpha) Q + \alpha P \parallel Q)</math> and note that <math>D_\text{KL}(P \parallel Q) = f(1)</math>. The first derivative of <math>f</math> may be derived and evaluated as follows <math display="block"> \begin{align} f'(\alpha) &= \sum_{x \in \mathcal{X}} (P(x) - Q(x)) \left(\log\left(\frac{(1-\alpha) Q(x) + \alpha P(x)}{Q(x)}\right) + 1\right) \\ &= \sum_{x \in \mathcal{X}} (P(x) - Q(x)) \log\left(\frac{(1-\alpha) Q(x) + \alpha P(x)}{Q(x)}\right) \\ f'(0) &= 0 \end{align} </math> Further derivatives may be derived and evaluated as follows <math display="block"> \begin{align} f(\alpha) &= \sum_{x \in \mathcal{X}} \frac{(P(x) - Q(x))^2}{(1-\alpha) Q(x) + \alpha P(x)} \\ f(0) &= \sum_{x \in \mathcal{X}} \frac{(P(x) - Q(x))^2}{Q(x)} \\ f^{(n)}(\alpha) &= (-1)^n (n-2)!\sum_{x \in \mathcal{X}} \frac{(P(x) - Q(x))^n}{\left((1-\alpha) Q(x) + \alpha P(x)\right)^{n-1}} \\ f^{(n)}(0) &= (-1)^n (n-2)!\sum_{x \in \mathcal{X}} \frac{(P(x) - Q(x))^n}{Q(x)^{n-1}} \end{align} </math> Hence solving for <math>D_\text{KL}(P \parallel Q)</math> via the Taylor expansion of <math>f</math> about <math>0</math> evaluated at <math>\alpha=1</math> yields <math display="block"> \begin{align} D_\text{KL}(P \parallel Q) &= \sum_{n=0}^\infty \frac{f^{(n)}(0)}{n!} \\ &= \sum_{n=2}^\infty \frac{1}{n(n-1)} \sum_{x \in \mathcal{X}} \frac{(Q(x) - P(x))^n}{Q(x)^{n-1}} \end{align} </math> <math>P \leq 2Q</math> a.s. is a sufficient condition for convergence of the series by the following absolute convergence argument <math display="block"> \begin{align} \sum_{n=2}^\infty \left\vert\frac{1}{n(n-1)} \sum_{x \in \mathcal{X}} \frac{(Q(x) - P(x))^n}{Q(x)^{n-1}}\right\vert &= \sum_{n=2}^\infty \frac{1}{n(n-1)} \sum_{x \in \mathcal{X}} \left\vert Q(x) - P(x) \right\vert \left\vert 1 - \frac{P(x)}{Q(x)} \right\vert^{n-1} \\ &\leq \sum_{n=2}^\infty \frac{1}{n(n-1)} \sum_{x \in \mathcal{X}} \left\vert Q(x) - P(x) \right\vert \\ &\leq \sum_{n=2}^\infty \frac{1}{n(n-1)} \\ &= 1 \end{align} </math> <math>P \leq 2Q</math> a.s. is also a necessary condition for convergence of the series by the following proof by contradiction. Assume that <math>P > 2Q</math> with measure strictly greater than <math>0</math>. It then follows that there must exist some values <math>\varepsilon > 0</math>, <math>\rho > 0</math>, and <math>U < \infty</math> such that <math>P \geq 2Q + \varepsilon</math> and <math>Q \leq U</math> with measure <math>\rho</math>. The previous proof of sufficiency demonstrated that the measure <math>1-\rho</math> component of the series where <math>P \leq 2Q</math> is bounded, so we need only concern ourselves with the behavior of the measure <math>\rho</math> component of the series where <math>P \geq 2Q + \varepsilon</math>. The absolute value of the <math>n</math>th term of this component of the series is then lower bounded by <math>\frac1{n(n-1)} \rho \left(1 + \frac{\varepsilon}{U}\right)^n</math>, which is unbounded as <math>n \to \infty</math>, so the series diverges. Template:Hidden end

Duality formula for variational inferenceEdit

The following result, due to Donsker and Varadhan,<ref>Template:Cite journal</ref> is known as Donsker and Varadhan's variational formula.

Template:Math theorem

Template:Math proof

ExamplesEdit

Multivariate normal distributionsEdit

Suppose that we have two multivariate normal distributions, with means <math>\mu_0, \mu_1</math> and with (non-singular) covariance matrices <math>\Sigma_0, \Sigma_1.</math> If the two distributions have the same dimension, Template:Mvar, then the relative entropy between the distributions is as follows:<ref>{{#invoke:citation/CS1|citation |CitationClass=web }}</ref>

 D_\text{KL}\left(\mathcal{N}_0 \parallel \mathcal{N}_1\right) =
 \frac{1}{2}\left[ \operatorname{tr}\left(\Sigma_1^{-1}\Sigma_0\right) - k +
   \left(\mu_1 - \mu_0\right)^\mathsf{T} \Sigma_1^{-1}\left(\mu_1 - \mu_0\right) +
   \ln \frac{\det\Sigma_1}{\det\Sigma_0}
 \right].

</math>

The logarithm in the last term must be taken to base [[e (mathematical constant)|Template:Mvar]] since all terms apart from the last are base-Template:Mvar logarithms of expressions that are either factors of the density function or otherwise arise naturally. The equation therefore gives a result measured in nats. Dividing the entire expression above by <math>\ln(2)</math> yields the divergence in bits.

In a numerical implementation, it is helpful to express the result in terms of the Cholesky decompositions <math>L_0, L_1</math> such that <math>\Sigma_0 = L_0 L_0^T</math> and <math>\Sigma_1 = L_1 L_1^T</math>. Then with Template:Mvar and Template:Mvar solutions to the triangular linear systems <math>L_1 M = L_0</math>, and <math>L_1 y = \mu_1 - \mu_0</math>,

 D_\text{KL}\left(\mathcal{N}_0 \parallel \mathcal{N}_1\right) =
 \frac{1}{2}\left(
   \sum_{i,j=1}^k {\left(M_{ij}\right)}^2 - k +
   |y|^2 +
   2\sum_{i=1}^k \ln \frac{(L_1)_{ii}}{(L_0)_{ii}}
 \right).

</math>

A special case, and a common quantity in variational inference, is the relative entropy between a diagonal multivariate normal, and a standard normal distribution (with zero mean and unit variance):

 D_\text{KL}\left(
   \mathcal{N}\left(\left(\mu_1, \ldots, \mu_k\right)^\mathsf{T}, \operatorname{diag} \left(\sigma_1^2, \ldots, \sigma_k^2\right)\right) \parallel
   \mathcal{N}\left(\mathbf{0}, \mathbf{I}\right)
 \right) =
 \frac{1}{2} \sum_{i=1}^k \left[\sigma_i^2 + \mu_i^2 - 1 - \ln\left(\sigma_i^2\right)\right].

</math>

For two univariate normal distributions Template:Mvar and Template:Mvar the above simplifies to<ref>Template:Cite journal</ref> <math display="block">

D_\text{KL}\left(\mathcal{p} \parallel \mathcal{q}\right) = \log \frac{\sigma_1}{\sigma_0} + \frac{\sigma_0^2 + {\left(\mu_0 - \mu_1\right)}^2}{2\sigma_1^2} - \frac{1}{2}

</math>

In the case of co-centered normal distributions with <math>k=\sigma_1/\sigma_0</math>, this simplifies<ref name="auto">Template:Cite book</ref> to:

D_\text{KL}\left(\mathcal{p} \parallel \mathcal{q}\right) = \log_2 k + (k^{-2}-1)/2/\ln(2) \mathrm{bits}

</math>

Uniform distributionsEdit

Consider two uniform distributions, with the support of <math>p=[A,B]</math> enclosed within <math>q=[C,D]</math> (<math>C\le A<B\le D</math>). Then the information gain is:

<math display="block"> D_\text{KL}\left(\mathcal{p} \parallel \mathcal{q}\right) = \log \frac{D-C}{B-A}</math>

Intuitively,<ref name="auto"/> the information gain to a Template:Mvar times narrower uniform distribution contains <math>\log_2 k</math> bits. This connects with the use of bits in computing, where <math>\log_2k</math> bits would be needed to identify one element of a Template:Mvar long stream.

Exponential familyEdit

The exponential family of distribution is given by

<math display="block">p_X(x|\theta) = h(x) \exp\left(\theta^\mathsf{T} T(x) - A(\theta)\right)</math>

where <math>h(x)</math> is reference measure, <math>T(x)</math> is sufficient statistics, <math>\theta</math> is canonical natural parameters, and <math>A(\theta)</math> is the log-partition function.

The KL divergence between two distributions <math>p(x|\theta_1)</math> and <math>p(x|\theta_2)</math> is given by<ref>Template:Cite arXiv</ref>

<math display="block">D_{\text{KL}}(\theta_1 \parallel \theta_2) = {\left(\theta_1 - \theta_2\right)}^\mathsf{T} \mu_1 - A(\theta_1) + A(\theta_2) </math>

where <math>\mu_1 = E_{\theta_1}[T(X)] = \nabla A(\theta_1)</math> is the mean parameter of <math>p(x|\theta_1)</math>.

For example, for the Poisson distribution with mean <math>\lambda</math>, the sufficient statistics <math>T(x) = x</math>, the natural parameter <math>\theta = \log \lambda</math>, and log partition function <math>A(\theta) = e^\theta</math>. As such, the divergence between two Poisson distributions with means <math>\lambda_1</math> and <math>\lambda_2</math> is

<math display="block">D_{\text{KL}}(\lambda_1 \parallel \lambda_2) = \lambda_1 \log \frac{\lambda_1}{\lambda_2} - \lambda_1 + \lambda_2.</math>

As another example, for a normal distribution with unit variance <math>N(\mu, 1)</math>, the sufficient statistics <math>T(x) = x</math>, the natural parameter <math>\theta = \mu</math>, and log partition function <math>A(\theta)=\mu^2/2</math>. Thus, the divergence between two normal distributions <math>N(\mu_1, 1)</math> and <math>N(\mu_2, 1)</math> is

<math display="block">D_{\text{KL}}(\mu_1 \parallel \mu_2) = \left(\mu_1 - \mu_2\right) \mu_1 - \frac{\mu_1^2}{2} + \frac{\mu_2^2}{2} = \frac{{\left(\mu_2 - \mu_1\right)}^2}{2}.</math>

As final example, the divergence between a normal distribution with unit variance <math>N(\mu, 1)</math> and a Poisson distribution with mean <math>\lambda</math> is

<math display="block">D_{\text{KL}}(\mu \parallel \lambda) = (\mu - \log \lambda) \mu - \frac{\mu^2}{2} + \lambda.</math>

Relation to metricsEdit

While relative entropy is a statistical distance, it is not a metric on the space of probability distributions, but instead it is a divergence.Template:Sfn While metrics are symmetric and generalize linear distance, satisfying the triangle inequality, divergences are asymmetric in general and generalize squared distance, in some cases satisfying a generalized Pythagorean theorem. In general <math>D_\text{KL}(P \parallel Q)</math> does not equal <math>D_\text{KL}(Q \parallel P)</math>, and while this can be symmetrized (see Template:Slink), the asymmetry is an important part of the geometry.Template:Sfn

It generates a topology on the space of probability distributions. More concretely, if <math>\{P_1,P_2,\ldots\}</math> is a sequence of distributions such that

<math display="block">\lim_{n \to \infty} D_\text{KL}(P_n\parallel Q) = 0,</math>

then it is said that

<math display="block">P_n \xrightarrow{D} \, Q .</math>

Pinsker's inequality entails that

<math display="block">P_n \xrightarrow{D} P \Rightarrow P_n \xrightarrow{TV} P ,</math>

where the latter stands for the usual convergence in total variation.

Fisher information metricEdit

Relative entropy is directly related to the Fisher information metric. This can be made explicit as follows. Assume that the probability distributions Template:Mvar and Template:Mvar are both parameterized by some (possibly multi-dimensional) parameter <math>\theta</math>. Consider then two close by values of <math>P = P(\theta)</math> and <math>Q = P(\theta_0)</math> so that the parameter <math>\theta</math> differs by only a small amount from the parameter value <math>\theta_0</math>. Specifically, up to first order one has (using the Einstein summation convention) <math display="block">P(\theta) = P(\theta_0) + \Delta\theta_j \, P_j(\theta_0) + \cdots</math>

with <math>\Delta\theta_j = (\theta - \theta_0)_j</math> a small change of <math>\theta</math> in the Template:Mvar direction, and <math>P_j\left(\theta_0\right) = \frac{\partial P}{\partial \theta_j}(\theta_0)</math> the corresponding rate of change in the probability distribution. Since relative entropy has an absolute minimum 0 for <math>P = Q</math>, i.e. <math> \theta = \theta_0 </math>, it changes only to second order in the small parameters <math>\Delta\theta_j</math>. More formally, as for any minimum, the first derivatives of the divergence vanish

<math display="block">\left.\frac{\partial}{\partial\theta_j}\right|_{\theta = \theta_0} D_\text{KL}(P(\theta) \parallel P(\theta_0)) = 0,</math>

and by the Taylor expansion one has up to second order

<math display="block">D_\text{KL}(P(\theta) \parallel P(\theta_0)) = \frac{1}{2} \, \Delta\theta_j \, \Delta\theta_k \, g_{jk}(\theta_0) + \cdots</math>

where the Hessian matrix of the divergence

<math display="block">g_{jk}(\theta_0) = \left.\frac{\partial^2}{\partial\theta_j\, \partial\theta_k} \right|_{\theta = \theta_0} D_\text{KL}(P(\theta) \parallel P(\theta_0))</math>

must be positive semidefinite. Letting <math>\theta_0</math> vary (and dropping the subindex 0) the Hessian <math>g_{jk}(\theta)</math> defines a (possibly degenerate) Riemannian metric on the Template:Mvar parameter space, called the Fisher information metric.

Fisher information metric theoremEdit

When <math>p_{(x, \rho)}</math> satisfies the following regularity conditions:

<math display="block">\frac{\partial \log(p)}{\partial \rho}, \frac{\partial^2 \log(p)}{\partial \rho^2}, \frac{\partial^3 \log(p)}{\partial \rho^3}</math> exist, <math display="block">\begin{align}

\left|\frac{\partial p}{\partial \rho}\right| &< F(x): \int_{x=0}^\infty F(x)\,dx < \infty, \\
\left|\frac{\partial^2 p}{\partial \rho^2}\right| &< G(x): \int_{x=0}^\infty G(x)\,dx < \infty \\
\left|\frac{\partial^3 \log(p)}{\partial \rho^3}\right| &< H(x): \int_{x=0}^\infty p(x, 0)H(x)\,dx < \xi < \infty

\end{align}</math>

where Template:Mvar is independent of Template:Mvar <math display="block">

\left.\int_{x=0}^\infty \frac{\partial p(x, \rho)}{\partial \rho}\right|_{\rho=0}\, dx =
\left.\int_{x=0}^\infty \frac{\partial^2 p(x, \rho)}{\partial \rho^2}\right|_{\rho=0}\, dx = 0

</math>

then: <math display="block">\mathcal{D}(p(x, 0) \parallel p(x, \rho)) = \frac{c\rho^2}{2} + \mathcal{O}\left(\rho^3\right) \text{ as } \rho \to 0.</math>

Variation of informationEdit

Another information-theoretic metric is variation of information, which is roughly a symmetrization of conditional entropy. It is a metric on the set of partitions of a discrete probability space.

MAUVE MetricEdit

MAUVE is a measure of the statistical gap between two text distributions, such as the difference between text generated by a model and human-written text. This measure is computed using Kullback–Leibler divergences between the two distributions in a quantized embedding space of a foundation model.

Relation to other quantities of information theoryEdit

Many of the other quantities of information theory can be interpreted as applications of relative entropy to specific cases.

Self-informationEdit

{{#invoke:Labelled list hatnote|labelledList|Main article|Main articles|Main page|Main pages}} The self-information, also known as the information content of a signal, random variable, or event is defined as the negative logarithm of the probability of the given outcome occurring.

When applied to a discrete random variable, the self-information can be represented asTemplate:Citation needed

<math display="block">\operatorname \operatorname{I}(m) = D_\text{KL}\left(\delta_\text{im} \parallel \{p_i\}\right),</math>

is the relative entropy of the probability distribution <math>P(i)</math> from a Kronecker delta representing certainty that <math>i = m</math> — i.e. the number of extra bits that must be transmitted to identify Template:Mvar if only the probability distribution <math>P(i)</math> is available to the receiver, not the fact that <math>i = m</math>.

Mutual informationEdit

The mutual information,

<math display="block">\begin{align}

\operatorname{I}(X; Y)
  &= D_\text{KL}(P(X, Y) \parallel P(X)P(Y)) \\[5pt]
  &= \operatorname{E}_X \{D_\text{KL}(P(Y \mid X) \parallel P(Y))\} \\[5pt]
  &= \operatorname{E}_Y \{D_\text{KL}(P(X \mid Y) \parallel P(X))\}

\end{align}</math>

is the relative entropy of the joint probability distribution <math>P(X,Y)</math> from the product <math>P(X)P(Y)</math> of the two marginal probability distributions — i.e. the expected number of extra bits that must be transmitted to identify Template:Mvar and Template:Mvar if they are coded using only their marginal distributions instead of the joint distribution. Equivalently, if the joint probability <math>P(X,Y)</math> is known, it is the expected number of extra bits that must on average be sent to identify Template:Mvar if the value of Template:Mvar is not already known to the receiver.

Shannon entropyEdit

The Shannon entropy,

<math display="block">\begin{align}

\Eta(X) &= \operatorname{E} \left[\operatorname{I}_X(x)\right] \\
        &= \log N - D_\text{KL}{\left(p_X(x) \parallel P_U(X)\right)}

\end{align}</math>

is the number of bits which would have to be transmitted to identify Template:Mvar from Template:Mvar equally likely possibilities, less the relative entropy of the uniform distribution on the random variates of Template:Mvar, <math>P_U(X)</math>, from the true distribution <math>P(X)</math> — i.e. less the expected number of bits saved, which would have had to be sent if the value of Template:Mvar were coded according to the uniform distribution <math>P_U(X)</math> rather than the true distribution <math>P(X)</math>. This definition of Shannon entropy forms the basis of E.T. Jaynes's alternative generalization to continuous distributions, the limiting density of discrete points (as opposed to the usual differential entropy), which defines the continuous entropy as <math display="block">\lim_{N \to \infty} H_{N}(X) = \log N - \int p(x)\log\frac{p(x)}{m(x)}\,dx,</math> which is equivalent to: <math display="block">\log(N) - D_\text{KL}(p(x)||m(x))</math>

Conditional entropyEdit

The conditional entropy Template:R,

<math display="block">\begin{align}

 \Eta(X \mid Y)
   &= \log N - D_\text{KL}(P(X, Y) \parallel P_U(X) P(Y)) \\[5pt]
   &= \log N - D_\text{KL}(P(X, Y) \parallel P(X) P(Y)) - D_\text{KL}(P(X) \parallel P_U(X)) \\[5pt]
   &= \Eta(X) - \operatorname{I}(X; Y) \\[5pt]
   &= \log N - \operatorname{E}_Y \left[D_\text{KL}\left(P\left(X \mid Y\right) \parallel P_U(X)\right)\right]

\end{align}</math>

is the number of bits which would have to be transmitted to identify Template:Mvar from Template:Mvar equally likely possibilities, less the relative entropy of the product distribution <math>P_U(X) P(Y)</math> from the true joint distribution <math>P(X,Y)</math> — i.e. less the expected number of bits saved which would have had to be sent if the value of Template:Mvar were coded according to the uniform distribution <math>P_U(X)</math> rather than the conditional distribution <math>P(X|Y)</math> of Template:Mvar given Template:Mvar.

Cross entropyEdit

When we have a set of possible events, coming from the distribution Template:Mvar, we can encode them (with a lossless data compression) using entropy encoding. This compresses the data by replacing each fixed-length input symbol with a corresponding unique, variable-length, prefix-free code (e.g.: the events (A, B, C) with probabilities p = (1/2, 1/4, 1/4) can be encoded as the bits (0, 10, 11)). If we know the distribution Template:Mvar in advance, we can devise an encoding that would be optimal (e.g.: using Huffman coding). Meaning the messages we encode will have the shortest length on average (assuming the encoded events are sampled from Template:Mvar), which will be equal to Shannon's Entropy of Template:Mvar (denoted as <math>\Eta(p)</math>). However, if we use a different probability distribution (Template:Mvar) when creating the entropy encoding scheme, then a larger number of bits will be used (on average) to identify an event from a set of possibilities. This new (larger) number is measured by the cross entropy between Template:Mvar and Template:Mvar.

The cross entropy between two probability distributions (Template:Mvar and Template:Mvar) measures the average number of bits needed to identify an event from a set of possibilities, if a coding scheme is used based on a given probability distribution Template:Mvar, rather than the "true" distribution Template:Mvar. The cross entropy for two distributions Template:Mvar and Template:Mvar over the same probability space is thus defined as follows.

<math display="block">\Eta(p, q) = \operatorname{E}_p[-\log q] = \Eta(p) + D_\text{KL}(p \parallel q).</math>

For explicit derivation of this, see the Motivation section above.

Under this scenario, relative entropies (kl-divergence) can be interpreted as the extra number of bits, on average, that are needed (beyond <math>\Eta(p)</math>) for encoding the events because of using Template:Mvar for constructing the encoding scheme instead of Template:Mvar.

Bayesian updatingEdit

In Bayesian statistics, relative entropy can be used as a measure of the information gain in moving from a prior distribution to a posterior distribution: <math>p(x) \to p(x\mid I)</math>. If some new fact <math>Y = y</math> is discovered, it can be used to update the posterior distribution for Template:Mvar from <math>p(x\mid I)</math> to a new posterior distribution <math>p(x\mid y,I)</math> using Bayes' theorem:

This distribution has a new entropy:

which may be less than or greater than the original entropy <math>\Eta(p(x\mid I))</math>. However, from the standpoint of the new probability distribution one can estimate that to have used the original code based on <math>p(x\mid I)</math> instead of a new code based on <math>p(x\mid y, I)</math> would have added an expected number of bits:

<math display="block"> D_\text{KL}\big(p(x \mid y, I) \parallel p(x \mid I) \big) = \sum_x p(x \mid y, I) \log \frac{p(x \mid y, I)}{p(x \mid I)} </math>

to the message length. This therefore represents the amount of useful information, or information gain, about Template:Mvar, that has been learned by discovering <math>Y = y</math>.

If a further piece of data, <math>Y_2 = y_2</math>, subsequently comes in, the probability distribution for Template:Mvar can be updated further, to give a new best guess <math>p(x \mid y_1, y_2, I)</math>. If one reinvestigates the information gain for using <math>p(x \mid y_1,I)</math> rather than <math>p(x \mid I)</math>, it turns out that it may be either greater or less than previously estimated:

and so the combined information gain does not obey the triangle inequality:

<math display="block">D_\text{KL} \big( p(x \mid y_1, y_2, I) \parallel p(x \mid I) \big)</math> may be <, = or > than <math>D_\text{KL}\big( p(x \mid y_1, y_2, I) \parallel p(x \mid y_1, I)\big) + D_\text{KL}\big(p(x \mid y_1, I) \parallel p(x \mid I)\big)</math>

All one can say is that on average, averaging using <math>p(y_2 \mid y_1, x, I)</math>, the two sides will average out.

Bayesian experimental designEdit

A common goal in Bayesian experimental design is to maximise the expected relative entropy between the prior and the posterior.<ref>Template:Cite journal</ref> When posteriors are approximated to be Gaussian distributions, a design maximising the expected relative entropy is called Bayes d-optimal.

Discrimination informationEdit

Relative entropy <math display="inline">D_\text{KL}\bigl(p(x \mid H_1) \parallel p(x \mid H_0)\bigr)</math> can also be interpreted as the expected discrimination information for <math>H_1</math> over <math>H_0</math>: the mean information per sample for discriminating in favor of a hypothesis <math>H_1</math> against a hypothesis <math>H_0</math>, when hypothesis <math>H_1</math> is true.<ref>Template:Cite book</ref> Another name for this quantity, given to it by I. J. Good, is the expected weight of evidence for <math>H_1</math> over <math>H_0</math> to be expected from each sample.

The expected weight of evidence for <math>H_1</math> over <math>H_0</math> is not the same as the information gain expected per sample about the probability distribution <math>p(H)</math> of the hypotheses,

<math display="block">D_\text{KL}(p(x \mid H_1) \parallel p(x \mid H_0)) \neq IG = D_\text{KL}(p(H \mid x) \parallel p(H \mid I)).</math>

Either of the two quantities can be used as a utility function in Bayesian experimental design, to choose an optimal next question to investigate: but they will in general lead to rather different experimental strategies.

On the entropy scale of information gain there is very little difference between near certainty and absolute certainty—coding according to a near certainty requires hardly any more bits than coding according to an absolute certainty. On the other hand, on the logit scale implied by weight of evidence, the difference between the two is enormous – infinite perhaps; this might reflect the difference between being almost sure (on a probabilistic level) that, say, the Riemann hypothesis is correct, compared to being certain that it is correct because one has a mathematical proof. These two different scales of loss function for uncertainty are both useful, according to how well each reflects the particular circumstances of the problem in question.

Principle of minimum discrimination informationEdit

The idea of relative entropy as discrimination information led Kullback to propose the Principle of Template:Visible anchor (MDI): given new facts, a new distribution Template:Mvar should be chosen which is as hard to discriminate from the original distribution <math>f_0</math> as possible; so that the new data produces as small an information gain <math>D_\text{KL}(f \parallel f_0)</math> as possible.

For example, if one had a prior distribution <math>p(x,a)</math> over Template:Mvar and Template:Mvar, and subsequently learnt the true distribution of Template:Mvar was <math>u(a)</math>, then the relative entropy between the new joint distribution for Template:Mvar and Template:Mvar, <math>q(x\mid a)u(a)</math>, and the earlier prior distribution would be:

<math display="block">D_\text{KL}(q(x \mid a)u(a) \parallel p(x, a)) = \operatorname{E}_{u(a)}\left\{D_\text{KL}(q(x \mid a) \parallel p(x \mid a))\right\} + D_\text{KL}(u(a) \parallel p(a)),</math>

i.e. the sum of the relative entropy of <math>p(a)</math> the prior distribution for Template:Mvar from the updated distribution <math>u(a)</math>, plus the expected value (using the probability distribution <math>u(a)</math>) of the relative entropy of the prior conditional distribution <math>p(x\mid a)</math> from the new conditional distribution <math>q(x\mid a)</math>. (Note that often the later expected value is called the conditional relative entropy (or conditional Kullback–Leibler divergence) and denoted by <math>D_\text{KL}(q(x\mid a) \parallel p(x\mid a))</math>Template:Sfn<ref name="CoverThomas">Template:Citation</ref>) This is minimized if <math>q(x\mid a)=p(x\mid a)</math> over the whole support of <math>u(a)</math>; and we note that this result incorporates Bayes' theorem, if the new distribution <math>u(a)</math> is in fact a δ function representing certainty that Template:Mvar has one particular value.

MDI can be seen as an extension of Laplace's Principle of Insufficient Reason, and the Principle of Maximum Entropy of E.T. Jaynes. In particular, it is the natural extension of the principle of maximum entropy from discrete to continuous distributions, for which Shannon entropy ceases to be so useful (see differential entropy), but the relative entropy continues to be just as relevant.

In the engineering literature, MDI is sometimes called the Principle of Minimum Cross-Entropy (MCE) or Minxent for short. Minimising relative entropy from Template:Mvar to Template:Mvar with respect to Template:Mvar is equivalent to minimizing the cross-entropy of Template:Mvar and Template:Mvar, since

<math display="block">\Eta(p, m) = \Eta(p) + D_\text{KL}(p \parallel m),</math>

which is appropriate if one is trying to choose an adequate approximation to Template:Mvar. However, this is just as often not the task one is trying to achieve. Instead, just as often it is Template:Mvar that is some fixed prior reference measure, and Template:Mvar that one is attempting to optimise by minimising <math>D_\text{KL}(p \parallel m)</math> subject to some constraint. This has led to some ambiguity in the literature, with some authors attempting to resolve the inconsistency by redefining cross-entropy to be <math>D_\text{KL}(p \parallel m)</math>, rather than <math>\Eta(p,m)</math> Template:Citation needed.

Relationship to available workEdit

File:ArgonKLdivergence.png

Pressure versus volume plot of available work from a mole of argon gas relative to ambient, calculated as <math>T_o</math> times the Kullback–Leibler divergence

Surprisals<ref>Template:Cite book</ref> add where probabilities multiply. The surprisal for an event of probability Template:Mvar is defined as <math>s = - k \ln p</math>. If Template:Mvar is <math>\left\{1, 1/\ln 2, 1.38 \times 10^{-23}\right\}</math> then surprisal is in <math>\{</math>nats, bits, or <math>J/K\}</math> so that, for instance, there are Template:Mvar bits of surprisal for landing all "heads" on a toss of Template:Mvar coins.

Best-guess states (e.g. for atoms in a gas) are inferred by maximizing the average surprisal Template:Mvar (entropy) for a given set of control parameters (like pressure Template:Mvar or volume Template:Mvar). This constrained entropy maximization, both classically<ref>Template:Cite journal</ref> and quantum mechanically,<ref>Template:Cite journal</ref> minimizes Gibbs availability in entropy units<ref>Template:Cite book footnote page 52.</ref> <math>A \equiv -k \ln Z</math> where Template:Mvar is a constrained multiplicity or partition function.

When temperature Template:Mvar is fixed, free energy (<math>T \times A</math>) is also minimized. Thus if <math>T, V</math> and number of molecules Template:Mvar are constant, the Helmholtz free energy <math>F \equiv U - TS</math> (where Template:Mvar is energy and Template:Mvar is entropy) is minimized as a system "equilibrates." If Template:Mvar and Template:Mvar are held constant (say during processes in your body), the Gibbs free energy <math>G = U + PV - TS</math> is minimized instead. The change in free energy under these conditions is a measure of available work that might be done in the process. Thus available work for an ideal gas at constant temperature <math>T_o</math> and pressure <math>P_o</math> is <math>W = \Delta G = NkT_o\Theta(V/V_o)</math> where <math>V_o = NkT_o/P_o</math> and <math>\Theta(x) = x - 1 - \ln x \ge 0</math> (see also Gibbs inequality).

More generally<ref>Template:Cite journal</ref> the work available relative to some ambient is obtained by multiplying ambient temperature <math>T_o</math> by relative entropy or net surprisal <math>\Delta I \ge 0,</math> defined as the average value of <math>k\ln(p/p_o)</math> where <math>p_o</math> is the probability of a given state under ambient conditions. For instance, the work available in equilibrating a monatomic ideal gas to ambient values of <math>V_o</math> and <math>T_o</math> is thus <math>W = T_o \Delta I</math>, where relative entropy

<math display="block">\Delta I = N k \left[\Theta{\left(\frac{V}{V_o}\right)} + \frac{3}{2} \Theta{\left(\frac{T}{T_o}\right)}\right].</math>

The resulting contours of constant relative entropy, shown at right for a mole of Argon at standard temperature and pressure, for example put limits on the conversion of hot to cold as in flame-powered air-conditioning or in the unpowered device to convert boiling-water to ice-water discussed here.<ref>Template:Cite journal</ref> Thus relative entropy measures thermodynamic availability in bits.

Quantum information theoryEdit

For density matrices Template:Mvar and Template:Mvar on a Hilbert space, the quantum relative entropy from Template:Mvar to Template:Mvar is defined to be

<math display="block">D_\text{KL}(P \parallel Q) = \operatorname{Tr}(P(\log P - \log Q)).</math>

In quantum information science the minimum of <math> D_\text{KL}(P\parallel Q)</math> over all separable states Template:Mvar can also be used as a measure of entanglement in the state Template:Mvar.

Relationship between models and realityEdit

Template:See also Template:Further

Just as relative entropy of "actual from ambient" measures thermodynamic availability, relative entropy of "reality from a model" is also useful even if the only clues we have about reality are some experimental measurements. In the former case relative entropy describes distance to equilibrium or (when multiplied by ambient temperature) the amount of available work, while in the latter case it tells you about surprises that reality has up its sleeve or, in other words, how much the model has yet to learn.

Although this tool for evaluating models against systems that are accessible experimentally may be applied in any field, its application to selecting a statistical model via Akaike information criterion are particularly well described in papers<ref>Template:Cite journal</ref> and a book<ref>Template:Cite book</ref> by Burnham and Anderson. In a nutshell the relative entropy of reality from a model may be estimated, to within a constant additive term, by a function of the deviations observed between data and the model's predictions (like the mean squared deviation) . Estimates of such divergence for models that share the same additive term can in turn be used to select among models.

When trying to fit parametrized models to data there are various estimators which attempt to minimize relative entropy, such as maximum likelihood and maximum spacing estimators.Template:Citation needed

Symmetrised divergenceEdit

Template:Harvtxt also considered the symmetrized function:Template:Sfn

<math display="block">D_\text{KL}(P \parallel Q) + D_\text{KL}(Q \parallel P)</math>

which they referred to as the "divergence", though today the "KL divergence" refers to the asymmetric function (see Template:Slink for the evolution of the term). This function is symmetric and nonnegative, and had already been defined and used by Harold Jeffreys in 1948;Template:Sfn it is accordingly called the Jeffreys divergence.

This quantity has sometimes been used for feature selection in classification problems, where Template:Mvar and Template:Mvar are the conditional pdfs of a feature under two different classes. In the Banking and Finance industries, this quantity is referred to as Population Stability Index (PSI), and is used to assess distributional shifts in model features through time.

An alternative is given via the <math>\lambda</math>-divergence,

<math display="block">D_\lambda(P \parallel Q) = \lambda D_\text{KL}(P \parallel \lambda P + (1 - \lambda)Q) + (1 - \lambda) D_\text{KL}(Q \parallel \lambda P + (1 - \lambda)Q),</math>

which can be interpreted as the expected information gain about Template:Mvar from discovering which probability distribution Template:Mvar is drawn from, Template:Mvar or Template:Mvar, if they currently have probabilities <math>\lambda</math> and <math>1-\lambda</math> respectively.Template:Clarify Template:Citation needed

The value <math>\lambda = 0.5</math> gives the Jensen–Shannon divergence, defined by

<math display="block">D_\text{JS} = \tfrac{1}{2} D_\text{KL} (P \parallel M) + \tfrac{1}{2} D_\text{KL}(Q \parallel M)</math>

where Template:Mvar is the average of the two distributions,

<math display="block">M = \tfrac{1}{2} \left(P + Q\right).</math>

We can also interpret <math>D_\text{JS}</math> as the capacity of a noisy information channel with two inputs giving the output distributions Template:Mvar and Template:Mvar. The Jensen–Shannon divergence, like all Template:Mvar-divergences, is locally proportional to the Fisher information metric. It is similar to the Hellinger metric (in the sense that it induces the same affine connection on a statistical manifold).

Furthermore, the Jensen–Shannon divergence can be generalized using abstract statistical M-mixtures relying on an abstract mean M.<ref name="Nielsen2019">Template:Cite journal</ref><ref name="Nielsen2020">Template:Cite journal</ref>

Relationship to other probability-distance measuresEdit

There are many other important measures of probability distance. Some of these are particularly connected with relative entropy. For example:

The total-variation distance, <math>\delta(p,q)</math>. This is connected to the divergence through Pinsker's inequality: <math display="block">\delta(P, Q) \le \sqrt{\tfrac{1}{2} D_\text{KL}(P \parallel Q)}.</math> Pinsker's inequality is vacuous for any distributions where <math>D_{\mathrm{KL}}(P\parallel Q)>2</math>, since the total variation distance is at most Template:Mvar. For such distributions, an alternative bound can be used, due to Bretagnolle and Huber<ref>Template:Citation Lemma 2.1</ref> (see, also, Tsybakov<ref>Template:Cite book Equation 2.25.</ref>): <math display="block">\delta(P,Q) \le \sqrt{1-e^{ -D_{\mathrm{KL}}(P\parallel Q) }}.</math>
The family of Rényi divergences generalize relative entropy. Depending on the value of a certain parameter, <math>\alpha</math>, various inequalities may be deduced.

Other notable measures of distance include the Hellinger distance, histogram intersection, Chi-squared statistic, quadratic form distance, match distance, Kolmogorov–Smirnov distance, and earth mover's distance.<ref name="earth">Template:Cite journal</ref>

Data differencingEdit

{{#invoke:Labelled list hatnote|labelledList|Main article|Main articles|Main page|Main pages}} Just as absolute entropy serves as theoretical background for data compression, relative entropy serves as theoretical background for data differencing – the absolute entropy of a set of data in this sense being the data required to reconstruct it (minimum compressed size), while the relative entropy of a target set of data, given a source set of data, is the data required to reconstruct the target given the source (minimum size of a patch).

ReferencesEdit

Template:Reflist Template:Refbegin

Template:Cite book
Template:Citation. Republished by Dover Publications in 1968; reprinted in 1978: Template:Isbn.
Template:Cite book

Template:Refend

Kullback–Leibler divergence

Introduction and contextEdit

EtymologyEdit

DefinitionEdit

Basic exampleEdit

InterpretationsEdit

StatisticsEdit

CodingEdit

InferenceEdit

Information geometryEdit

Finance (game theory)Edit

MotivationEdit

PropertiesEdit

Duality formula for variational inferenceEdit

ExamplesEdit

Multivariate normal distributionsEdit

Uniform distributionsEdit

Exponential familyEdit

Relation to metricsEdit

Fisher information metricEdit

Fisher information metric theoremEdit

Variation of informationEdit

MAUVE MetricEdit

Relation to other quantities of information theoryEdit

Self-informationEdit

Mutual informationEdit

Shannon entropyEdit

Conditional entropyEdit

Cross entropyEdit

Bayesian updatingEdit

Bayesian experimental designEdit

Discrimination informationEdit

Principle of minimum discrimination informationEdit

Relationship to available workEdit

Quantum information theoryEdit

Relationship between models and realityEdit

Symmetrised divergenceEdit

Relationship to other probability-distance measuresEdit

Data differencingEdit

See alsoEdit

ReferencesEdit

External linksEdit