Editing Principle of maximum entropy (section)

==Justifications for the principle of maximum entropy==
Proponents of the principle of maximum entropy justify its use in assigning probabilities in several ways, including the following two arguments. These arguments take the use of [[Bayesian probability]] as given, and are thus subject to the same postulates.

===Information entropy as a measure of 'uninformativeness'===
Consider a '''discrete probability distribution''' among <math> m </math> mutually exclusive [[proposition]]s. The most informative distribution would occur when one of the propositions was known to be true. In that case, the information entropy would be equal to zero. The least informative distribution would occur when there is no reason to favor any one of the propositions over the others. In that case, the only reasonable probability distribution would be uniform, and then the information entropy would be equal to its maximum possible value, <math> \log m </math>. The information entropy can therefore be seen as a numerical measure which describes how uninformative a particular probability distribution is, ranging from zero (completely informative) to <math> \log m </math> (completely uninformative).

By choosing to use the distribution with the maximum entropy allowed by our information, the argument goes, we are choosing the most uninformative distribution possible. To choose a distribution with lower entropy would be to assume information we do not possess. Thus the maximum entropy distribution is the only reasonable distribution. The [http://projecteuclid.org/euclid.ba/1340370710 dependence of the solution] on the dominating measure represented by <math> m(x) </math> is however a source of criticisms of the approach since this dominating measure is in fact arbitrary.<ref name=Druihlet2007/>

===The Wallis derivation===
The following argument is the result of a suggestion made by [[Graham Wallis]] to E. T. Jaynes in 1962.<ref name=Jaynes2003/> It is essentially the same mathematical argument used for the [[Maxwell–Boltzmann statistics]] in [[statistical mechanics]], although the conceptual emphasis is quite different. It has the advantage of being strictly combinatorial in nature, making no reference to information entropy as a measure of 'uncertainty', 'uninformativeness', or any other imprecisely defined concept. The information entropy function is not assumed ''a priori'', but rather is found in the course of the argument; and the argument leads naturally to the procedure of maximizing the information entropy, rather than treating it in some other way.

Suppose an individual wishes to make a probability assignment among <math> m </math>  [[mutually exclusive]] propositions. They have some testable information, but are not sure how to go about including this information in their probability assessment. They therefore conceive of the following random experiment. They will distribute <math> N </math> quanta of probability (each worth <math> 1 / N </math>) at random among the <math> m </math> possibilities. (One might imagine that they will throw <math> N </math> balls into <math> m </math> buckets while blindfolded. In order to be as fair as possible, each throw is to be independent of any other, and every bucket is to be the same size.) Once the experiment is done, they will check if the probability assignment thus obtained is consistent with their information. (For this step to be successful, the information must be a constraint given by an open set in the space of probability measures). If it is inconsistent, they will reject it and try again. If it is consistent, their assessment will be

:<math>p_i = \frac{n_i}{N}</math>

where <math> p_i </math> is the probability of the <math>i</math><sup>th</sup> proposition, while ''n<sub>i</sub>'' is the number of quanta that were assigned to the <math>i</math><sup>th</sup> proposition (i.e. the number of balls that ended up in bucket <math> i </math>).

Now, in order to reduce the 'graininess' of the probability assignment, it will be necessary to use quite a large number of quanta of probability. Rather than actually carry out, and possibly have to repeat, the rather long random experiment, the protagonist decides to simply calculate and use the most probable result. The probability of any particular result is the [[multinomial distribution]],

:<math>Pr(\mathbf{p}) = W \cdot m^{-N}</math>

where

:<math>W = \frac{N!}{n_1! \, n_2! \, \dotsb \, n_m!}</math>

is sometimes known as the multiplicity of the outcome.

The most probable result is the one which maximizes the multiplicity <math> W </math>. Rather than maximizing <math> W </math> directly, the protagonist could equivalently maximize any monotonic increasing function of <math> W </math>. They decide to maximize

:<math>\begin{align}
\frac 1 N \log W 
&= \frac 1 N \log \frac{N!}{n_1! \, n_2! \, \dotsb \, n_m!} \\[6pt]
&= \frac 1 N \log \frac{N!}{(Np_1)! \, (Np_2)! \, \dotsb \, (Np_m)!} \\[6pt]
&= \frac 1 N \left( \log N! - \sum_{i=1}^m \log ((Np_i)!) \right).
\end{align}</math>

At this point, in order to simplify the expression, the protagonist takes the limit as <math>N\to\infty</math>, i.e. as the probability levels go from grainy  discrete values to smooth continuous values. Using [[Stirling's approximation]], they find

:<math>
\begin{align}
\lim_{N \to \infty}\left(\frac{1}{N}\log W\right) 
&= \frac 1 N \left( N\log N - \sum_{i=1}^m Np_i\log (Np_i) \right)  \\[6pt]
&= \log N - \sum_{i=1}^m p_i\log (Np_i)  \\[6pt]
&= \log N - \log N \sum_{i=1}^m p_i - \sum_{i=1}^m p_i\log p_i  \\[6pt]
&= \left(1 - \sum_{i=1}^m p_i \right)\log N - \sum_{i=1}^m p_i\log p_i  \\[6pt]
&= - \sum_{i=1}^m p_i\log p_i \\[6pt]
&= H(\mathbf{p}).
\end{align}
</math>

All that remains for the protagonist to do is to maximize entropy under the constraints of their testable information. They have found that the maximum entropy distribution is the most probable of all "fair" random distributions, in the limit as the probability levels go from discrete to continuous.

===Compatibility with Bayes' theorem===
Giffin and Caticha (2007) state that [[Bayes' theorem]] and the principle of maximum entropy are completely compatible and can be seen as special cases of the "method of maximum relative entropy". They state that this method reproduces every aspect of orthodox Bayesian inference methods. In addition this new method opens the door to tackling problems that could not be addressed by either the maximal entropy principle or orthodox Bayesian methods individually. Moreover, recent contributions (Lazar 2003, and Schennach 2005) show that frequentist relative-entropy-based inference approaches (such as [[empirical likelihood]] and [[exponentially tilted empirical likelihood]] – see e.g. Owen 2001 and Kitamura 2006) can be combined with prior information to perform Bayesian posterior analysis.

Jaynes stated Bayes' theorem was a way to calculate a probability, while maximum entropy was a way to assign a prior probability distribution.<ref name=Jaynes1988/>

It is however, possible in concept to solve for a posterior distribution directly from a stated prior distribution using the [[Cross-entropy|principle of minimum cross-entropy]] (or the Principle of Maximum Entropy being a special case of using a [[uniform distribution (discrete)|uniform distribution]] as the given prior), independently of any Bayesian considerations by treating the problem formally as a constrained optimisation problem, the Entropy functional being the objective function.  For the case of given average values as testable information (averaged over the sought after probability distribution), the sought after distribution is formally the [[Gibbs measure|Gibbs (or Boltzmann) distribution]] the parameters of which must be solved for in order to achieve minimum cross entropy and satisfy the given testable information.