Editing Beta distribution (section)

====Jeffreys' prior probability (Beta(1/2,1/2) for a Bernoulli or for a binomial distribution)====
{{Main|Jeffreys prior}}

[[File:Jeffreys prior probability for the beta distribution - J. Rodal.png|thumb|[[Jeffreys prior]] probability for the beta distribution: the square root of the determinant of [[Fisher's information]] matrix: <math>\scriptstyle\sqrt{\det(\mathcal{I}(\alpha, \beta))} = \sqrt{\psi_1(\alpha)\psi_1(\beta)-( \psi_1(\alpha)+\psi_1(\beta) )\psi_1(\alpha + \beta)}</math> is a function of the [[trigamma function]] ψ<sub>1</sub> of shape parameters α, β]]

[[File:Beta distribution for 3 different prior probability functions - J. Rodal.png|thumb|Posterior Beta densities with samples having success = "s", failure = "f" of ''s''/(''s'' + ''f'') = 1/2, and ''s'' + ''f'' = {3,10,50}, based on 3 different prior probability functions: Haldane (Beta(0,0), Jeffreys (Beta(1/2,1/2)) and Bayes (Beta(1,1)). The image shows that there is little difference between the priors for the posterior with sample size of 50 (with more pronounced peak near ''p''&nbsp;=&nbsp;1/2). Significant differences appear for very small sample sizes (the flatter distribution for sample size of&nbsp;3)]]

[[File:Beta distribution for 3 different prior probability functions, skewed case - J. Rodal.png|thumb|Posterior Beta densities with samples having success = "s", failure = "f" of ''s''/(''s'' + ''f'') = 1/4, and ''s'' + ''f'' &isin; {3,10,50}, based on three different prior probability functions: Haldane (Beta(0,0), Jeffreys (Beta(1/2,1/2)) and Bayes (Beta(1,1)). The image shows that there is little difference between the priors for the posterior with sample size of 50 (with more pronounced peak near ''p'' = 1/4).  Significant differences appear for very small sample sizes (the very skewed distribution for the degenerate case of sample size&nbsp;=&nbsp;3, in this degenerate and unlikely case the Haldane prior results in a reverse "J" shape with mode at ''p''&nbsp;=&nbsp;0 instead of ''p''&nbsp;=&nbsp;1/4.  If there is sufficient [[Sample (statistics)|sampling data]], the three priors of Bayes (Beta(1,1)), Jeffreys (Beta(1/2,1/2)) and Haldane (Beta(0,0)) should yield similar [[posterior probability|''posterior'' probability]] densities.]]

[[File:Beta distribution for 3 different prior probability functions, skewed case sample size = (4,12,40) - J. Rodal.png|thumb|Posterior Beta densities with samples having success = ''s'', failure = ''f'' of ''s''/(''s'' + ''f'') = 1/4, and ''s'' + ''f'' &isin; {4,12,40}, based on three different prior probability functions: Haldane (Beta(0,0), Jeffreys (Beta(1/2,1/2)) and Bayes (Beta(1,1)). The image shows that there is little difference between the priors for the posterior with sample size of 40 (with more pronounced peak near ''p''&nbsp;=&nbsp;1/4). Significant differences appear for very small sample sizes]]

[[Harold Jeffreys]]<ref name=Jeffreys>{{cite book|last=Jeffreys|first=Harold|title=Theory of Probability|year=1998|publisher=Oxford University Press, 3rd edition|isbn=978-0198503682}}</ref><ref name=JeffreysPRIOR>{{cite journal|last=Jeffreys|first=Harold|title=An Invariant Form for the Prior Probability in Estimation Problems|journal=Proceedings of the Royal Society|date=September 1946|volume=186|series=A 24|issue=1007|pages=453–461|doi=10.1098/rspa.1946.0056|pmid=20998741|bibcode=1946RSPSA.186..453J|doi-access=free}}</ref> proposed to use an [[uninformative prior]] probability measure that should be [[Parametrization invariance|invariant under reparameterization]]: proportional to the square root of the [[determinant]] of [[Fisher's information]] matrix.  For the [[Bernoulli distribution]], this can be shown as follows: for a coin that is "heads" with probability ''p'' ∈ [0, 1] and is "tails" with probability 1 − ''p'', for a given (H,T) ∈ {(0,1), (1,0)} the probability is ''p<sup>H</sup>''(1 − ''p'')<sup>''T''</sup>.  Since ''T'' = 1 − ''H'', the [[Bernoulli distribution]] is ''p<sup>H</sup>''(1 − ''p'')<sup>1 − ''H''</sup>. Considering ''p'' as the only parameter, it follows that the log likelihood for the Bernoulli distribution is

:<math>\ln  \mathcal{L} (p\mid H) = H \ln(p)+ (1-H) \ln(1-p).</math>

The Fisher information matrix has only one component (it is a scalar, because there is only one parameter: ''p''), therefore:

:<math>\begin{align}
\sqrt{\mathcal{I}(p)} &= \sqrt{\operatorname{E}\!\left[ \left( \frac{d}{dp} \ln(\mathcal{L} (p\mid H)) \right)^2\right]} \\[6pt]
&= \sqrt{\operatorname{E}\!\left[ \left( \frac{H}{p} - \frac{1-H}{1-p}\right)^2 \right]} \\[6pt]
&= \sqrt{p^1 (1-p)^0 \left( \frac{1}{p} - \frac{0}{1-p}\right)^2 + p^0 (1-p)^1 \left(\frac{0}{p} - \frac{1}{1-p}\right)^2} \\
&= \frac{1}{\sqrt{p(1-p)}}.
\end{align}</math>

Similarly, for the [[Binomial distribution]] with ''n'' [[Bernoulli trials]], it can be shown that

:<math>\sqrt{\mathcal{I}(p)}= \frac{\sqrt{n}}{\sqrt{p(1-p)}}.</math>

Thus, for the [[Bernoulli distribution|Bernoulli]], and [[Binomial distribution]]s, [[Jeffreys prior]] is proportional to <math>\scriptstyle \frac{1}{\sqrt{p(1-p)}}</math>, which happens to be proportional to a beta distribution with domain variable ''x'' = ''p'', and shape parameters α = β = 1/2, the [[arcsine distribution]]:

:<math>Beta(\tfrac{1}{2}, \tfrac{1}{2}) = \frac{1}{\pi \sqrt{p(1-p)}}.</math>

It will be shown in the next section that the normalizing constant for Jeffreys prior is immaterial to the final result because the normalizing constant cancels out in Bayes' theorem for the posterior probability.  Hence Beta(1/2,1/2) is used as the Jeffreys prior for both Bernoulli and binomial distributions. As shown in the next section, when using this expression as a prior probability times the likelihood in [[Bayes' theorem]], the posterior probability turns out to be a beta distribution. It is important to realize, however, that Jeffreys prior is proportional to <math>\scriptstyle \frac{1}{\sqrt{p(1-p)}}</math> for the Bernoulli and binomial distribution, but not for the beta distribution.  Jeffreys prior for the beta distribution is given by the determinant of Fisher's information for the beta distribution, which, as shown in the {{section link||Fisher information matrix}}  is a function of the [[trigamma function]] ψ<sub>1</sub> of shape parameters α and β as follows:

:<math> \begin{align}
\sqrt{\det(\mathcal{I}(\alpha, \beta))} &= \sqrt{\psi_1(\alpha)\psi_1(\beta)-(\psi_1(\alpha)+\psi_1(\beta))\psi_1(\alpha + \beta)} \\
\lim_{\alpha\to 0} \sqrt{\det(\mathcal{I}(\alpha, \beta))} &=\lim_{\beta \to 0} \sqrt{\det(\mathcal{I}(\alpha, \beta))} = \infty\\
\lim_{\alpha\to \infty} \sqrt{\det(\mathcal{I}(\alpha, \beta))} &=\lim_{\beta \to \infty} \sqrt{\det(\mathcal{I}(\alpha, \beta))} = 0
\end{align}</math>

As previously discussed, Jeffreys prior for the Bernoulli and binomial distributions is proportional to the [[arcsine distribution]] Beta(1/2,1/2), a one-dimensional ''curve'' that looks like a basin as a function of the parameter ''p'' of the Bernoulli and binomial distributions. The walls of the basin are formed by ''p'' approaching the singularities at the ends ''p'' → 0 and ''p'' → 1, where Beta(1/2,1/2) approaches infinity. Jeffreys prior for the beta distribution is a ''2-dimensional surface'' (embedded in a three-dimensional space) that looks like a basin with only two of its walls meeting at the corner α = β = 0 (and missing the other two walls) as a function of the shape parameters α and β of the beta distribution. The two adjoining walls of this 2-dimensional surface are formed by the shape parameters α and β approaching the singularities (of the trigamma function) at α, β → 0. It has no walls for α, β → ∞ because in this case the determinant of Fisher's information matrix for the beta distribution approaches zero.

It will be shown in the next section that Jeffreys prior probability results in posterior probabilities (when multiplied by the binomial likelihood function) that are intermediate between the posterior probability results of the Haldane and Bayes prior probabilities.

Jeffreys prior may be difficult to obtain analytically, and for some cases it just doesn't exist (even for simple distribution functions like the asymmetric [[triangular distribution]]). Berger, Bernardo and Sun, in a 2009 paper<ref name="BergerBernardoSun">{{cite journal|last=Berger|first=James |author2=Bernardo, Jose |author3=Sun, Dongchu|title=The formal definition of reference priors|journal=The Annals of Statistics|year=2009|volume=37|issue=2|pages=905–938|doi=10.1214/07-AOS587|url= http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?view=body&id=pdfview_1&handle=euclid.aos/1236693154|arxiv=0904.0156|bibcode=2009arXiv0904.0156B |s2cid=3221355 }}</ref>  defined a reference prior probability distribution that (unlike Jeffreys prior) exists for the asymmetric [[triangular distribution]]. They cannot obtain a closed-form expression for their reference prior, but numerical calculations show it to be nearly perfectly fitted by the (proper) prior

:<math> \operatorname{Beta}(\tfrac{1}{2}, \tfrac{1}{2}) \sim\frac{1}{\sqrt{\theta(1-\theta)}}</math>

where θ is the vertex variable for the asymmetric triangular distribution with support [0, 1] (corresponding to the following parameter values in Wikipedia's article on the [[triangular distribution]]: vertex ''c'' = ''θ'', left end ''a'' = 0,and right end ''b'' = 1). Berger et al. also give a heuristic argument that Beta(1/2,1/2) could indeed be the exact Berger–Bernardo–Sun reference prior for the asymmetric triangular distribution. Therefore, Beta(1/2,1/2) not only is Jeffreys prior for the Bernoulli and binomial distributions, but also seems to be the Berger–Bernardo–Sun reference prior for the asymmetric triangular distribution (for which the Jeffreys prior does not exist), a distribution used in project management and [[PERT]] analysis to describe the cost and duration of project tasks.

Clarke and Barron<ref>{{cite journal|last=Clarke|first=Bertrand S.|author2=Andrew R. Barron|title=Jeffreys' prior is asymptotically least favorable under entropy risk|journal=Journal of Statistical Planning and Inference|year=1994|volume=41|pages=37–60|url=http://www.stat.yale.edu/~arb4/publications_files/jeffery's%20prior.pdf|doi=10.1016/0378-3758(94)90153-8}}</ref> prove that, among continuous positive priors, Jeffreys prior (when it exists) asymptotically maximizes Shannon's [[mutual information]] between a sample of size n and the parameter, and therefore ''Jeffreys prior is the most uninformative prior'' (measuring information as Shannon information). The proof rests on an examination of the [[Kullback–Leibler divergence]] between probability density functions for [[iid]] random variables.