Editing Principle of maximum entropy (section)

===The Wallis derivation===
The following argument is the result of a suggestion made by [[Graham Wallis]] to E. T. Jaynes in 1962.<ref name=Jaynes2003/> It is essentially the same mathematical argument used for the [[Maxwell–Boltzmann statistics]] in [[statistical mechanics]], although the conceptual emphasis is quite different. It has the advantage of being strictly combinatorial in nature, making no reference to information entropy as a measure of 'uncertainty', 'uninformativeness', or any other imprecisely defined concept. The information entropy function is not assumed ''a priori'', but rather is found in the course of the argument; and the argument leads naturally to the procedure of maximizing the information entropy, rather than treating it in some other way.

Suppose an individual wishes to make a probability assignment among <math> m </math>  [[mutually exclusive]] propositions. They have some testable information, but are not sure how to go about including this information in their probability assessment. They therefore conceive of the following random experiment. They will distribute <math> N </math> quanta of probability (each worth <math> 1 / N </math>) at random among the <math> m </math> possibilities. (One might imagine that they will throw <math> N </math> balls into <math> m </math> buckets while blindfolded. In order to be as fair as possible, each throw is to be independent of any other, and every bucket is to be the same size.) Once the experiment is done, they will check if the probability assignment thus obtained is consistent with their information. (For this step to be successful, the information must be a constraint given by an open set in the space of probability measures). If it is inconsistent, they will reject it and try again. If it is consistent, their assessment will be

:<math>p_i = \frac{n_i}{N}</math>

where <math> p_i </math> is the probability of the <math>i</math><sup>th</sup> proposition, while ''n<sub>i</sub>'' is the number of quanta that were assigned to the <math>i</math><sup>th</sup> proposition (i.e. the number of balls that ended up in bucket <math> i </math>).

Now, in order to reduce the 'graininess' of the probability assignment, it will be necessary to use quite a large number of quanta of probability. Rather than actually carry out, and possibly have to repeat, the rather long random experiment, the protagonist decides to simply calculate and use the most probable result. The probability of any particular result is the [[multinomial distribution]],

:<math>Pr(\mathbf{p}) = W \cdot m^{-N}</math>

where

:<math>W = \frac{N!}{n_1! \, n_2! \, \dotsb \, n_m!}</math>

is sometimes known as the multiplicity of the outcome.

The most probable result is the one which maximizes the multiplicity <math> W </math>. Rather than maximizing <math> W </math> directly, the protagonist could equivalently maximize any monotonic increasing function of <math> W </math>. They decide to maximize

:<math>\begin{align}
\frac 1 N \log W 
&= \frac 1 N \log \frac{N!}{n_1! \, n_2! \, \dotsb \, n_m!} \\[6pt]
&= \frac 1 N \log \frac{N!}{(Np_1)! \, (Np_2)! \, \dotsb \, (Np_m)!} \\[6pt]
&= \frac 1 N \left( \log N! - \sum_{i=1}^m \log ((Np_i)!) \right).
\end{align}</math>

At this point, in order to simplify the expression, the protagonist takes the limit as <math>N\to\infty</math>, i.e. as the probability levels go from grainy  discrete values to smooth continuous values. Using [[Stirling's approximation]], they find

:<math>
\begin{align}
\lim_{N \to \infty}\left(\frac{1}{N}\log W\right) 
&= \frac 1 N \left( N\log N - \sum_{i=1}^m Np_i\log (Np_i) \right)  \\[6pt]
&= \log N - \sum_{i=1}^m p_i\log (Np_i)  \\[6pt]
&= \log N - \log N \sum_{i=1}^m p_i - \sum_{i=1}^m p_i\log p_i  \\[6pt]
&= \left(1 - \sum_{i=1}^m p_i \right)\log N - \sum_{i=1}^m p_i\log p_i  \\[6pt]
&= - \sum_{i=1}^m p_i\log p_i \\[6pt]
&= H(\mathbf{p}).
\end{align}
</math>

All that remains for the protagonist to do is to maximize entropy under the constraints of their testable information. They have found that the maximum entropy distribution is the most probable of all "fair" random distributions, in the limit as the probability levels go from discrete to continuous.