Editing Beta distribution (section)

===Quantities of information (entropy)===
Given a beta distributed random variable, ''X'' ~&nbsp;Beta(''α'',&nbsp;''β''), the [[information entropy|differential entropy]] of ''X'' is (measured in [[Nat (unit)|nats]]),<ref>{{cite journal |first1=A. C. G. |last1=Verdugo Lazo |first2=P. N. |last2=Rathie |title=On the entropy of continuous probability distributions |journal=IEEE Trans. Inf. Theory |volume=24 |issue=1 |pages=120–122 |year=1978 |doi=10.1109/TIT.1978.1055832 }}</ref> the expected value of the negative of the logarithm of the [[probability density function]]:

:<math>\begin{align}
h(X) &= \operatorname{E}[-\ln(f(X;\alpha,\beta))] \\[4pt]
&=\int_0^1 -f(x;\alpha,\beta)\ln(f(x;\alpha,\beta)) \, dx \\[4pt]
&= \ln(\Beta(\alpha,\beta))-(\alpha-1)\psi(\alpha)-(\beta-1)\psi(\beta)+(\alpha+\beta-2) \psi(\alpha+\beta)
\end{align}</math>

where ''f''(''x''; ''α'', ''β'') is the [[probability density function]] of the beta distribution:

:<math>f(x;\alpha,\beta) = \frac{1}{\Beta(\alpha,\beta)} x^{\alpha-1}(1-x)^{\beta-1}</math>

The [[digamma function]] ''ψ'' appears in the formula for the differential entropy as a consequence of Euler's integral formula for the [[harmonic number]]s which follows from the integral:

:<math>\int_0^1 \frac {1-x^{\alpha-1}}{1-x} \, dx = \psi(\alpha)-\psi(1)</math>

The [[information entropy|differential entropy]] of the beta distribution is negative for all values of ''α'' and ''β'' greater than zero, except at ''α''&nbsp;=&nbsp;''β''&nbsp;=&nbsp;1 (for which values the beta distribution is the same as the [[Uniform distribution (continuous)|uniform distribution]]), where the [[information entropy|differential entropy]] reaches its [[Maxima and minima|maximum]] value of zero.  It is to be expected that the maximum entropy should take place when the beta distribution becomes equal to the uniform distribution, since uncertainty is maximal when all possible events are equiprobable.

For ''α'' or ''β'' approaching zero, the [[information entropy|differential entropy]] approaches its [[Maxima and minima|minimum]] value of negative infinity. For (either or both) ''α'' or ''β'' approaching zero, there is a maximum amount of order: all the probability density is concentrated at the ends, and there is zero probability density at points located between the ends. Similarly for (either or both) ''α'' or ''β'' approaching infinity, the differential entropy approaches its minimum value of negative infinity, and a maximum amount of order.  If either ''α'' or ''β'' approaches infinity (and the other is finite) all the probability density is concentrated at an end, and the probability density is zero everywhere else.  If both shape parameters are equal (the symmetric case), ''α'' = ''β'', and they approach infinity simultaneously, the probability density becomes a spike ([[Dirac delta function]]) concentrated at the middle ''x''&nbsp;=&nbsp;1/2, and hence there is 100% probability at the middle ''x''&nbsp;=&nbsp;1/2 and zero probability everywhere else.

[[File:Differential Entropy Beta Distribution for alpha and beta from 1 to 5 - J. Rodal.jpg|325px]][[File:Differential Entropy Beta Distribution for alpha and beta from 0.1 to 5 - J. Rodal.jpg|325px]]

The (continuous case) [[information entropy|differential entropy]] was introduced by Shannon in his original paper (where he named it the "entropy of a continuous distribution"), as the concluding part of the same paper where he defined the [[information entropy|discrete entropy]].<ref>{{cite journal |last=Shannon |first=Claude E. |title=A Mathematical Theory of Communication |journal=Bell System Technical Journal |volume=27 |issue=4 |pages=623–656 |year=1948 |doi=10.1002/j.1538-7305.1948.tb01338.x }}</ref>  It is known since then that the differential entropy may differ from the infinitesimal limit of the discrete entropy by an infinite offset, therefore the differential entropy can be negative (as it is for the beta distribution). What really matters is the relative value of entropy.

Given two beta distributed random variables, ''X''<sub>1</sub>&nbsp;~&nbsp;Beta(''α'',&nbsp;''β'') and ''X''<sub>2</sub> ~ Beta(''{{prime|α}}'', ''{{prime|β}}''), the [[cross-entropy]] is (measured in nats)<ref name="Cover and Thomas">{{cite book|last=Cover|first=Thomas M. and Joy A. Thomas|title=Elements of Information Theory 2nd Edition (Wiley Series in Telecommunications and Signal Processing) |year=2006 |publisher=Wiley-Interscience; 2 edition |isbn=978-0471241959}}</ref>

:<math>\begin{align}
H(X_1,X_2) &= \int_0^1 - f(x;\alpha,\beta) \ln (f(x;\alpha',\beta')) \,dx \\[4pt]
&= \ln \left(\Beta(\alpha',\beta')\right)-(\alpha'-1)\psi(\alpha)-(\beta'-1)\psi(\beta)+(\alpha'+\beta'-2)\psi(\alpha+\beta).
\end{align}</math>

The [[cross entropy]] has been used as an error metric to measure the distance between two hypotheses.<ref name=Plunkett>{{cite book|last=Plunkett|first=Kim, and Jeffrey Elman|title=Exercises in Rethinking Innateness: A Handbook for Connectionist Simulations (Neural Network Modeling and Connectionism)|year=1997|publisher=A Bradford Book|page=166|isbn=978-0262661058|url-access=registration|url=https://archive.org/details/exercisesinrethi0000plun}}</ref><ref name=Nallapati>{{cite thesis|last=Nallapati|first=Ramesh|title=The smoothed dirichlet distribution: understanding cross-entropy ranking in information retrieval|year=2006|publisher=Computer Science Dept., University of Massachusetts Amherst|url=http://maroo.cs.umass.edu/pub/web/getpdf.php?id=679}}</ref>  Its absolute value is minimum when the two distributions are identical. It is the information measure most closely related to the log maximum likelihood <ref name="Cover and Thomas" />(see section on "Parameter estimation. Maximum likelihood estimation")).

The relative entropy, or [[Kullback&ndash;Leibler divergence]] ''D''<sub>KL</sub>(''X''<sub>1</sub> || ''X''<sub>2</sub>), is a measure of the inefficiency of assuming that the distribution is ''X''<sub>2</sub> ~ Beta(''{{prime|α}}'', ''{{prime|β}}'')  when the distribution is really ''X''<sub>1</sub> ~ Beta(''α'', ''β''). It is defined as follows (measured in nats).

:<math>\begin{align}
D_{\mathrm{KL}}(X_1\parallel X_2) &= \int_0^1 f(x;\alpha,\beta) \ln \left (\frac{f(x;\alpha,\beta)}{f(x;\alpha',\beta')} \right ) \, dx \\[4pt]
&= \left (\int_0^1 f(x;\alpha,\beta) \ln (f(x;\alpha,\beta)) \,dx \right )- \left (\int_0^1 f(x;\alpha,\beta) \ln (f(x;\alpha',\beta')) \, dx \right )\\[4pt]
&= -h(X_1) + H(X_1,X_2)\\[4pt]
&= \ln\left(\frac{\Beta(\alpha',\beta')}{\Beta(\alpha,\beta)}\right)+(\alpha-\alpha')\psi(\alpha)+(\beta-\beta')\psi(\beta)+(\alpha'-\alpha+\beta'-\beta)\psi (\alpha + \beta).
\end{align} </math>

The relative entropy, or [[Kullback&ndash;Leibler divergence]], is always non-negative.  A few numerical examples follow:

*''X''<sub>1</sub> ~ Beta(1, 1) and ''X''<sub>2</sub> ~ Beta(3, 3); ''D''<sub>KL</sub>(''X''<sub>1</sub> || ''X''<sub>2</sub>) = 0.598803; ''D''<sub>KL</sub>(''X''<sub>2</sub> || ''X''<sub>1</sub>) = 0.267864; ''h''(''X''<sub>1</sub>) = 0; ''h''(''X''<sub>2</sub>) = −0.267864
*''X''<sub>1</sub> ~ Beta(3, 0.5) and ''X''<sub>2</sub> ~ Beta(0.5, 3); ''D''<sub>KL</sub>(''X''<sub>1</sub> || ''X''<sub>2</sub>) = 7.21574; ''D''<sub>KL</sub>(''X''<sub>2</sub> || ''X''<sub>1</sub>) = 7.21574; ''h''(''X''<sub>1</sub>) = −1.10805; ''h''(''X''<sub>2</sub>) = −1.10805.

The [[Kullback&ndash;Leibler divergence]] is not symmetric ''D''<sub>KL</sub>(''X''<sub>1</sub> || ''X''<sub>2</sub>) ≠ ''D''<sub>KL</sub>(''X''<sub>2</sub> || ''X''<sub>1</sub>)  for the case in which the individual beta distributions Beta(1, 1) and Beta(3, 3) are symmetric, but have different entropies ''h''(''X''<sub>1</sub>) ≠ ''h''(''X''<sub>2</sub>). The value of the Kullback divergence depends on the direction traveled: whether going from a higher (differential) entropy to a lower (differential) entropy or the other way around. In the numerical example above, the Kullback divergence measures the inefficiency of assuming that the distribution is (bell-shaped) Beta(3, 3), rather than (uniform) Beta(1, 1). The "h" entropy of Beta(1, 1) is higher than the "h" entropy of Beta(3, 3) because the uniform distribution Beta(1, 1) has a maximum amount of disorder. The Kullback divergence is more than two times higher (0.598803 instead of 0.267864) when measured in the direction of decreasing entropy: the direction that assumes that the (uniform) Beta(1, 1) distribution is (bell-shaped) Beta(3, 3) rather than the other way around. In this restricted sense, the Kullback divergence is consistent with the [[second law of thermodynamics]].

The [[Kullback&ndash;Leibler divergence]] is symmetric ''D''<sub>KL</sub>(''X''<sub>1</sub> || ''X''<sub>2</sub>) = ''D''<sub>KL</sub>(''X''<sub>2</sub> || ''X''<sub>1</sub>) for the skewed cases Beta(3, 0.5) and Beta(0.5, 3) that have equal differential entropy ''h''(''X''<sub>1</sub>) = ''h''(''X''<sub>2</sub>).

The symmetry condition:

:<math>D_{\mathrm{KL}}(X_1\parallel X_2) = D_{\mathrm{KL}}(X_2\parallel X_1),\text{ if }h(X_1) = h(X_2),\text{ for (skewed) }\alpha \neq \beta</math>

follows from the above definitions and the mirror-symmetry ''f''(''x''; ''α'', ''β'') = ''f''(1 − ''x''; ''α'', ''β'') enjoyed by the beta distribution.