Editing Entropy (information theory) (section)

==Entropy for continuous random variables==

===Differential entropy===
{{Main|Differential entropy}}

The Shannon entropy is restricted to random variables taking discrete values. The corresponding formula for a continuous random variable with [[probability density function]] {{math|''f''(''x'')}} with finite or infinite support <math>\mathbb X</math> on the real line is defined by analogy, using the above form of the entropy as an expectation:<ref name=cover1991/>{{rp|p=224}}

<math display="block">\Eta(X) = \mathbb{E}[-\log f(X)] = -\int_\mathbb X f(x) \log f(x)\, \mathrm{d}x.</math>

This is the differential entropy (or continuous entropy). A precursor of the continuous entropy {{math|''h''[''f'']}} is the expression for the functional {{math|''Η''}} in the [[H-theorem]] of Boltzmann.

Although the analogy between both functions is suggestive, the following question must be set: is the differential entropy a valid extension of the Shannon discrete entropy? Differential entropy lacks a number of properties that the Shannon discrete entropy has&nbsp;– it can even be negative&nbsp;– and corrections have been suggested, notably [[limiting density of discrete points]].

To answer this question, a connection must be established between the two functions:

In order to obtain a generally finite measure as the [[bin size]] goes to zero. In the discrete case, the bin size is the (implicit) width of each of the {{math|''n''}} (finite or infinite) bins whose probabilities are denoted by {{math|''p''<sub>''n''</sub>}}. As the continuous domain is generalized, the width must be made explicit.

To do this, start with a continuous function {{math|''f''}} discretized into bins of size <math>\Delta</math>.
<!-- Figure: Discretizing the function $ f$ into bins of width $ \Delta$ \includegraphics[width=\textwidth]{function-with-bins.eps} --><!-- The original article this figure came from is at http://planetmath.org/shannonsentropy but it is broken there too -->
By the mean-value theorem there exists a value {{math|''x''<sub>''i''</sub>}} in each bin such that
<math display="block">f(x_i) \Delta = \int_{i\Delta}^{(i+1)\Delta} f(x)\, dx</math>
the integral of the function {{math|''f''}} can be approximated (in the Riemannian sense) by
<math display="block">\int_{-\infty}^{\infty} f(x)\, dx = \lim_{\Delta \to 0} \sum_{i = -\infty}^{\infty} f(x_i) \Delta ,</math>
where this limit and "bin size goes to zero" are equivalent.

We will denote
<math display="block">\Eta^{\Delta} := - \sum_{i=-\infty}^{\infty} f(x_i)  \Delta \log \left(  f(x_i)  \Delta \right)</math>
and expanding the logarithm, we have
<math display="block">\Eta^{\Delta} = - \sum_{i=-\infty}^{\infty}  f(x_i)  \Delta \log (f(x_i)) -\sum_{i=-\infty}^{\infty} f(x_i) \Delta \log (\Delta).</math>

As {{math|Δ → 0}}, we have

<math display="block">\begin{align}
\sum_{i=-\infty}^{\infty} f(x_i) \Delta &\to \int_{-\infty}^{\infty} f(x)\, dx = 1 \\
\sum_{i=-\infty}^{\infty} f(x_i) \Delta \log (f(x_i)) &\to \int_{-\infty}^{\infty} f(x) \log f(x)\, dx.
\end{align}</math>

Note; {{math|log(Δ) → −∞}} as {{math|Δ → 0}}, requires a special definition of the differential or continuous entropy:

<math display="block">h[f] = \lim_{\Delta \to 0} \left(\Eta^{\Delta} + \log \Delta\right) = -\int_{-\infty}^{\infty} f(x) \log f(x)\,dx,</math>

which is, as said before, referred to as the differential entropy. This means that the differential entropy ''is not'' a limit of the Shannon entropy for {{math|''n'' → ∞}}. Rather, it differs from the limit of the Shannon entropy by an infinite offset (see also the article on [[information dimension]]).

===Limiting density of discrete points===
{{Main|Limiting density of discrete points}}

It turns out as a result that, unlike the Shannon entropy, the differential entropy is ''not'' in general a good measure of uncertainty or information. For example, the differential entropy can be negative; also it is not invariant under continuous co-ordinate transformations. This problem may be illustrated by a change of units when {{math|''x''}} is a dimensioned variable. {{math|''f''(''x'')}} will then have the units of {{math|1/''x''}}. The argument of the logarithm must be dimensionless, otherwise it is improper, so that the differential entropy as given above will be improper. If {{math|''&Delta;''}} is some "standard" value of {{math|''x''}} (i.e. "bin size") and therefore has the same units, then a modified differential entropy may be written in proper form as:
<math display="block" display="block">\Eta=\int_{-\infty}^\infty f(x) \log(f(x)\,\Delta)\,dx ,</math>
and the result will be the same for any choice of units for {{math|''x''}}. In fact, the limit of discrete entropy as <math> N \rightarrow \infty </math> would also include a term of <math> \log(N)</math>, which would in general be infinite. This is expected: continuous variables would typically have infinite entropy when discretized. The [[limiting density of discrete points]] is really a measure of how much easier a distribution is to describe than a distribution that is uniform over its quantization scheme.

===Relative entropy===
{{main|Generalized relative entropy}}
Another useful measure of entropy that works equally well in the discrete and the continuous case is the '''relative entropy''' of a distribution. It is defined as the [[Kullback–Leibler divergence]] from the distribution to a reference measure {{math|''m''}} as follows. Assume that a probability distribution {{math|''p''}} is [[absolutely continuous]] with respect to a measure {{math|''m''}}, i.e. is of the form {{math|''p''(''dx'') {{=}} ''f''(''x'')''m''(''dx'')}} for some non-negative {{math|''m''}}-integrable function {{math|''f''}} with {{math|''m''}}-integral 1, then the relative entropy can be defined as
<math display="block">D_{\mathrm{KL}}(p \| m ) = \int \log (f(x)) p(dx) = \int f(x)\log (f(x)) m(dx) .</math>

In this form the relative entropy generalizes (up to change in sign) both the discrete entropy, where the measure {{math|''m''}} is the [[counting measure]], and the differential entropy, where the measure {{math|''m''}} is the [[Lebesgue measure]]. If the measure {{math|''m''}} is itself a probability distribution, the relative entropy is non-negative, and zero if {{math|''p'' {{=}} ''m''}} as measures. It is defined for any measure space, hence coordinate independent and invariant under co-ordinate reparameterizations if one properly takes into account the transformation of the measure {{math|''m''}}. The relative entropy, and (implicitly) entropy and differential entropy, do depend on the "reference" measure {{math|''m''}}.