Editing Principle of maximum entropy (section)

==General solution for the maximum entropy distribution with linear constraints==
{{main|Maximum entropy probability distribution}}

===Discrete case===
We have some testable information ''I'' about a quantity ''x'' taking values in {''x<sub>1</sub>'', ''x<sub>2</sub>'',..., ''x<sub>n</sub>''}. We assume this information has the form of ''m'' constraints on the expectations of the functions ''f<sub>k</sub>''; that is, we require our probability distribution to satisfy the moment inequality/equality constraints:

:<math>\sum_{i=1}^n \Pr(x_i)f_k(x_i) \geq F_k \qquad k = 1, \ldots,m.</math>

where the <math> F_k </math> are observables.  We also require the probability density to sum to one, which may be viewed as a primitive constraint on the identity function and an observable equal to 1 giving the constraint

:<math>\sum_{i=1}^n \Pr(x_i) = 1.</math>

The probability distribution with maximum information entropy subject to these inequality/equality constraints is of the form:<ref name="BK08"/>

:<math>\Pr(x_i) = \frac{1}{Z(\lambda_1,\ldots, \lambda_m)} \exp\left[\lambda_1 f_1(x_i) + \cdots + \lambda_m f_m(x_i)\right],</math>

for some <math>\lambda_1,\ldots,\lambda_m</math>. It is sometimes called the [[Gibbs distribution]]. The normalization constant is determined by:

:<math> Z(\lambda_1,\ldots, \lambda_m) = \sum_{i=1}^n \exp\left[\lambda_1 f_1(x_i) + \cdots + \lambda_m f_m(x_i)\right],</math>

and is conventionally called the [[partition function (mathematics)|partition function]].  (The [[Pitman&ndash;Koopman theorem]] states that the necessary and sufficient condition for a sampling distribution to admit [[sufficiency (statistics)|sufficient statistics]] of bounded dimension is that it have the general form of a maximum entropy distribution.)

The λ<sub>k</sub> parameters are Lagrange multipliers. In the case of equality constraints their values are determined from the solution of the nonlinear equations

:<math>F_k = \frac{\partial}{\partial \lambda_k} \log Z(\lambda_1,\ldots, \lambda_m).</math>

In the case of inequality constraints, the Lagrange multipliers are determined from the solution of a [[convex optimization]] program with linear constraints.<ref name="BK08"/> 
In both cases, there is no [[closed form solution]], and the computation of the Lagrange multipliers  usually requires  [[Numerical analysis|numerical methods]].

===Continuous case===
For [[continuous distribution]]s, the Shannon entropy cannot be used, as it is only defined for discrete probability spaces.  Instead [[E. T. Jaynes|Edwin Jaynes]] (1963, 1968, 2003) gave the following formula, which is closely related to the [[relative entropy]] (see also [[differential entropy]]).

:<math>H_c=-\int p(x)\log\frac{p(x)}{q(x)}\,dx</math>

where ''q''(''x''), which Jaynes called the "invariant measure", is proportional to the [[limiting density of discrete points]]. For now, we shall assume that ''q'' is known; we will discuss it further after the solution equations are given.

A closely related quantity, the relative entropy, is usually defined as the [[Kullback–Leibler divergence]] of ''p'' from ''q'' (although it is sometimes, confusingly, defined as the negative of this).  The inference principle of minimizing this, due to Kullback, is known as the [[Kullback–Leibler divergence#Principle of minimum discrimination information|Principle of Minimum Discrimination Information]].

We have some testable information ''I'' about a quantity ''x'' which takes values in some [[interval (mathematics)|interval]] of the [[real numbers]] (all integrals below are over this interval). We assume this information has the form of ''m'' constraints on the expectations of the functions ''f<sub>k</sub>'', i.e. we require our probability density function to satisfy the inequality (or purely equality) moment constraints:

:<math>\int p(x)f_k(x)\,dx \geq F_k \qquad k = 1, \dotsc,m.</math>

where the <math> F_k </math> are observables.  We also require the probability density to integrate to one, which may be viewed as a primitive constraint on the identity function and an observable equal to 1 giving the constraint

:<math>\int p(x)\,dx = 1.</math>

The probability density function with maximum ''H<sub>c</sub>'' subject to these constraints is:<ref name="BK11"/>

:<math>p(x) = \frac{1}{Z(\lambda_1,\dotsc, \lambda_m)} q(x)\exp\left[\lambda_1 f_1(x) + \dotsb + \lambda_m f_m(x)\right]</math>

with the [[partition function (mathematics)|partition function]] determined by

:<math> Z(\lambda_1,\dotsc, \lambda_m) = \int q(x)\exp\left[\lambda_1 f_1(x) + \dotsb + \lambda_m f_m(x)\right]\,dx.</math>

As in the discrete case, in the case where all moment constraints are equalities, the values of the <math>\lambda_k</math> parameters are determined by the system of nonlinear equations:

:<math>F_k = \frac{\partial}{\partial \lambda_k} \log Z(\lambda_1,\dotsc, \lambda_m).</math>

In the case with  inequality moment  constraints the Lagrange multipliers are determined from the solution of a [[convex optimization]] program.<ref name="BK11"/>

The invariant measure function ''q''(''x'') can be best understood by supposing that ''x'' is known to take values only in the [[bounded interval]] (''a'', ''b''), and that no other information is given. Then the maximum entropy probability density function is

:<math> p(x) = A \cdot q(x), \qquad a < x < b</math>

where ''A'' is a normalization constant. The invariant measure function is actually the prior density function encoding 'lack of relevant information'.  It cannot be determined by the principle of maximum entropy, and must be determined by some other logical method, such as the [[principle of transformation groups]] or [[Marginalization (probability)|marginalization theory]].

===Examples===
For several examples of maximum entropy distributions, see the article on [[maximum entropy probability distribution]]s.