Editing Entropy (information theory) (section)

==Characterization==
To understand the meaning of {{math|−Σ ''p''<sub>''i''</sub> log(''p''<sub>''i''</sub>)}}, first define an information function {{math|I}} in terms of an event {{math|''i''}} with probability {{math|''p''<sub>''i''</sub>}}. The amount of information acquired due to the observation of event {{math|''i''}} follows from Shannon's solution of the fundamental properties of [[Information content|information]]:<ref>{{cite book |last=Carter |first=Tom |date=March 2014 |title=An introduction to information theory and entropy |url=http://csustan.csustan.edu/~tom/Lecture-Notes/Information-Theory/info-lec.pdf |location=Santa Fe |access-date=4 August 2017 |archive-date=4 June 2016 |archive-url=https://web.archive.org/web/20160604130248/http://csustan.csustan.edu/~tom/Lecture-Notes/Information-Theory/info-lec.pdf |url-status=live }}</ref>
# {{math|I(''p'')}} is [[monotonically decreasing]] in {{math|''p''}}: an increase in the probability of an event decreases the information from an observed event, and vice versa.
# {{math|I(1) {{=}} 0}}: events that always occur do not communicate information.
# {{math|I(''p''<sub>1</sub>·''p''<sub>2</sub>) {{=}} I(''p''<sub>1</sub>) + I(''p''<sub>2</sub>)}}: the information learned from [[independent events]] is the sum of the information learned from each event.
# {{math|I(''p'')}} is a twice continuously differentiable function of p.

Given two independent events, if the first event can yield one of {{math|''n''}} [[equiprobable]] outcomes and another has one of {{math|''m''}} equiprobable outcomes then there are {{math|''mn''}} equiprobable outcomes of the joint event. This means that if {{math|log<sub>2</sub>(''n'')}} bits are needed to encode the first value and {{math|log<sub>2</sub>(''m'')}} to encode the second, one needs {{math|log<sub>2</sub>(''mn'') {{=}} log<sub>2</sub>(''m'') + log<sub>2</sub>(''n'')}} to encode both.

Shannon discovered that a suitable choice of <math>\operatorname{I}</math> is given by:<ref>Chakrabarti, C. G., and Indranil Chakrabarty. "Shannon entropy: axiomatic characterization and application." ''International Journal of Mathematics and Mathematical Sciences'' 2005. 17 (2005): 2847-2854 [https://arxiv.org/pdf/quant-ph/0511171.pdf url] {{Webarchive|url=https://web.archive.org/web/20211005135940/https://arxiv.org/pdf/quant-ph/0511171.pdf |date=5 October 2021 }}</ref>
<math display="block">\operatorname{I}(p) = \log\left(\tfrac{1}{p}\right) = -\log(p).</math>

In fact, the only possible values of <math>\operatorname{I}</math> are <math>\operatorname{I}(u) = k \log u</math> for <math>k<0</math>. Additionally, choosing a value for {{math|''k''}} is equivalent to choosing a value <math>x>1</math> for <math>k = - 1/\log x</math>, so that {{math|''x''}} corresponds to the [[Base of a logarithm|base for the logarithm]]. Thus, entropy is [[characterization (mathematics)|characterized]] by the above four properties.

{| class="toccolours collapsible collapsed" width="80%" style="text-align:left; margin-left:1.5em;"
!Proof
|-
|Let <math display="inline">\operatorname{I}</math> be the information function which one assumes to be twice continuously differentiable, one has:

<math display="block">\begin{align}
& \operatorname{I}(p_1 p_2) &=\ & \operatorname{I}(p_1) +  \operatorname{I}(p_2) && \quad \text{Starting from property 3} \\
& p_2 \operatorname{I}'(p_1 p_2) &=\ & \operatorname{I}'(p_1) && \quad \text{taking the derivative w.r.t}\ p_1 \\
& \operatorname{I}'(p_1 p_2) + p_1 p_2 \operatorname{I}''(p_1 p_2) &=\ & 0 && \quad \text{taking the derivative w.r.t}\ p_2 \\
& \operatorname{I}'(u) + u \operatorname{I}''(u) &=\ & 0 && \quad \text{introducing}\, u = p_1 p_2 \\
& (u\operatorname{I}'(u))' &=\ & 0 && \quad \text{combining terms into one}\ \\
& u\operatorname{I}'(u) - k  &=\ & 0 && \quad \text{integrating w.r.t}\ u, \text{producing constant}\, k \\
\end{align}</math>

This [[differential equation]] leads to the solution <math>\operatorname{I}(u) = k \log u + c</math> for some <math>k, c \in \mathbb{R}</math>. Property 2 gives <math>c = 0</math>. Property 1 and 2 give that <math>\operatorname{I}(p)\ge 0</math> for all <math>p\in [0,1]</math>, so that <math>k < 0</math>.
|}

The different [[units of information]] ([[bit]]s for the [[binary logarithm]] {{math|log<sub>2</sub>}}, [[Nat (unit)|nats]] for the [[natural logarithm]] {{math|ln}}, [[Ban (unit)|bans]] for the [[decimal logarithm]] {{math|log<sub>10</sub>}} and so on) are [[Proportionality (mathematics)|constant multiples]] of each other. For instance, in case of a fair coin toss, heads provides {{math|log<sub>2</sub>(2) {{=}} 1}} bit of information, which is approximately 0.693&nbsp;nats or 0.301&nbsp;decimal digits. Because of additivity, {{math|''n''}} tosses provide {{math|''n''}} bits of information, which is approximately {{math|0.693''n''}} nats or {{math|0.301''n''}} decimal digits.

The ''meaning'' of the events observed (the meaning of ''messages'') does not matter in the definition of entropy. Entropy only takes into account the probability of observing a specific event, so the information it encapsulates is information about the underlying [[probability distribution]], not the meaning of the events themselves.

===Alternative characterization===
Another characterization of entropy uses the following properties. We denote {{math|''p''<sub>''i''</sub> {{=}} Pr(''X'' {{=}} ''x''<sub>''i''</sub>)}} and {{math|Η<sub>''n''</sub>(''p''<sub>1</sub>, ..., ''p''<sub>''n''</sub>) {{=}} Η(''X'')}}.

# Continuity: {{math|H}} should be [[continuous function|continuous]], so that changing the values of the probabilities by a very small amount should only change the entropy by a small amount.
# Symmetry: {{math|H}} should be unchanged if the outcomes {{math|''x''<sub>''i''</sub>}} are re-ordered. That is, <math>\Eta_n\left(p_1, p_2, \ldots, p_n \right) = \Eta_n\left(p_{i_1}, p_{i_2}, \ldots, p_{i_n} \right)</math> for any [[permutation]] <math>\{i_1, ..., i_n\}</math> of <math>\{1, ..., n\}</math>.
# Maximum: <math>\Eta_n</math> should be maximal if all the outcomes are equally likely i.e. <math>\Eta_n(p_1,\ldots,p_n) \le \Eta_n\left(\frac{1}{n}, \ldots, \frac{1}{n}\right)</math>.
# Increasing number of outcomes: for equiprobable events, the entropy should increase with the number of outcomes i.e. <math>\Eta_n\bigg(\underbrace{\frac{1}{n}, \ldots, \frac{1}{n}}_{n}\bigg) < \Eta_{n+1}\bigg(\underbrace{\frac{1}{n+1}, \ldots, \frac{1}{n+1}}_{n+1}\bigg).</math>
# Additivity: given an ensemble of {{math|''n''}} uniformly distributed elements that are partitioned into {{math|''k''}} boxes (sub-systems) with {{math|''b''<sub>1</sub>, ..., ''b''<sub>''k''</sub>}} elements each, the entropy of the whole ensemble should be equal to the sum of the entropy of the system of boxes and the individual entropies of the boxes, each weighted with the probability of being in that particular box.

==== Discussion ====
The rule of additivity has the following consequences: for [[positive integers]] {{math|''b''<sub>''i''</sub>}} where {{math|''b''<sub>1</sub> + ... + ''b''<sub>''k''</sub> {{=}} ''n''}},
<math display="block">\Eta_n\left(\frac{1}{n}, \ldots, \frac{1}{n}\right) = \Eta_k\left(\frac{b_1}{n}, \ldots, \frac{b_k}{n}\right) + \sum_{i=1}^k \frac{b_i}{n} \, \Eta_{b_i}\left(\frac{1}{b_i}, \ldots, \frac{1}{b_i}\right).</math>

Choosing {{math|''k'' {{=}} ''n''}}, {{math|''b''<sub>1</sub> {{=}} ... {{=}} ''b''<sub>''n''</sub> {{=}} 1}} this implies that the entropy of a certain outcome is zero: {{math|Η<sub>1</sub>(1) {{=}} 0}}. This implies that the efficiency of a source set with {{math|''n''}} symbols can be defined simply as being equal to its {{math|''n''}}-ary entropy. See also [[Redundancy (information theory)]].

The characterization here imposes an additive property with respect to a [[partition of a set]]. Meanwhile, the [[conditional probability]] is defined in terms of a multiplicative property, <math>P(A\mid B)\cdot P(B)=P(A\cap B)</math>. Observe that a logarithm mediates between these two operations. The [[conditional entropy]] and related quantities inherit simple relation, in turn. The measure theoretic definition in the previous section defined the entropy as a sum over expected surprisals <math>\mu(A)\cdot \ln\mu(A)</math> for an extremal partition. Here the logarithm is ad hoc and the entropy is not a measure in itself. At least in the information theory of a binary string, <math>\log_2</math> lends itself to practical interpretations.

Motivated by such relations, a plethora of related and competing quantities have been defined. For example, [[David Ellerman]]'s analysis of a "logic of partitions" defines a competing measure in structures [[Duality (mathematics)|dual]] to that of subsets of a universal set.<ref>{{cite journal |last1=Ellerman |first1=David |title=Logical Information Theory: New Logical Foundations for Information Theory |journal=Logic Journal of the IGPL |date=October 2017 |volume=25 |issue=5 |pages=806–835 |doi=10.1093/jigpal/jzx022 |url=http://philsci-archive.pitt.edu/13213/1/Logic-to-information-theory3.pdf |access-date=2 November 2022 |archive-date=25 December 2022 |archive-url=https://web.archive.org/web/20221225080028/https://philsci-archive.pitt.edu/13213/1/Logic-to-information-theory3.pdf |url-status=live }}</ref> Information is quantified as "dits" (distinctions), a measure on partitions.  "Dits" can be converted into [[Shannon (unit)|Shannon's bits]], to get the formulas for conditional entropy, and so on.

===Alternative characterization via additivity and subadditivity===

Another succinct axiomatic characterization of Shannon entropy was given by [[János Aczél (mathematician)|Aczél]], Forte and Ng,<ref name="aczelentropy">{{cite journal|last1=Aczél|first1=J.|title=Why the Shannon and Hartley entropies are 'natural'|last2=Forte|first2=B.|last3=Ng|first3=C. T.|journal=Advances in Applied Probability|date=1974|volume=6|issue=1|pages=131–146|doi=10.2307/1426210 |jstor=1426210 |s2cid=204177762 }}</ref> via the following properties:

# Subadditivity:  <math>\Eta(X,Y) \le \Eta(X)+\Eta(Y)</math>   for jointly distributed random variables <math>X,Y</math>.
# Additivity:  <math>\Eta(X,Y) = \Eta(X)+\Eta(Y)</math> when the random variables <math>X,Y</math> are independent.
# Expansibility:  <math>\Eta_{n+1}(p_1, \ldots, p_n, 0) = \Eta_n(p_1, \ldots, p_n)</math>, i.e., adding an outcome with probability zero does not change the entropy.
# Symmetry: <math>\Eta_n(p_1, \ldots, p_n)</math> is invariant under permutation of <math>p_1, \ldots, p_n</math>.
# Small for small probabilities:  <math>\lim_{q \to 0^+} \Eta_2(1-q, q) = 0</math>.

==== Discussion ====
It was shown that any function <math>\Eta</math> satisfying the above properties must be a constant multiple of Shannon entropy, with a non-negative constant.<ref name="aczelentropy"/> Compared to the previously mentioned characterizations of entropy, this characterization focuses on the properties of entropy as a function of random variables (subadditivity and additivity), rather than the properties of entropy as a function of the probability vector <math>p_1,\ldots ,p_n</math>.

It is worth noting that if we drop the "small for small probabilities" property, then <math>\Eta</math> must be a non-negative linear combination of the Shannon entropy and the [[Hartley entropy]].<ref name="aczelentropy"/>