Editing Beta distribution (section)

====Effect of different prior probability choices on the posterior beta distribution====
If samples are drawn from the population of a random variable ''X'' that result in ''s'' successes and ''f'' failures in ''n'' [[Bernoulli trial]]s ''n''&nbsp;=&nbsp;''s''&nbsp;+&nbsp;''f'', then the [[likelihood function]] for parameters ''s'' and ''f'' given ''x''&nbsp;=&nbsp;''p'' (the notation ''x''&nbsp;=&nbsp;''p'' in the expressions below will emphasize that the domain ''x'' stands for the value of the parameter ''p'' in the binomial distribution), is the following [[binomial distribution]]:

:<math>\mathcal{L}(s,f\mid x=p) = {s+f \choose s} x^s(1-x)^f = {n \choose s} x^s(1-x)^{n - s}. </math>

If beliefs about [[prior probability]] information are reasonably well approximated by a beta distribution with parameters ''α''&nbsp;Prior and ''β''&nbsp;Prior, then:

:<math>{\operatorname{PriorProbability}}(x=p;\alpha \operatorname{Prior},\beta \operatorname{Prior}) = \frac{ x^{\alpha \operatorname{Prior}-1}(1-x)^{\beta \operatorname{Prior}-1}}{\Beta(\alpha \operatorname{Prior},\beta \operatorname{Prior})}</math>

According to [[Bayes' theorem]] for a continuous event space, the [[posterior probability]] density is given by the product of the [[prior probability]] and the likelihood function (given the evidence ''s'' and ''f''&nbsp;=&nbsp;''n''&nbsp;−&nbsp;''s''), normalized so that the area under the curve equals one, as follows:

:<math>\begin{align}
& \text{posterior probability density}(x=p\mid s,n-s) \\[6pt]
= {} & \frac{\operatorname{prior probability density}(x=p;\alpha \operatorname{prior},\beta \operatorname{prior}) \mathcal{L}(s,f\mid x=p)} {\int_0^1\text{prior probability density}(x=p;\alpha \operatorname{prior},\beta \operatorname{prior}) \mathcal{L}(s,f\mid x=p) \, dx} \\[6pt]
= {} & \frac{{{n \choose s} x^{s+\alpha \operatorname{prior}-1}(1-x)^{n-s+\beta \operatorname{prior}-1} / \Beta(\alpha \operatorname{prior},\beta \operatorname{prior})}}{\int_0^1 \left({n \choose s} x^{s+\alpha \operatorname{prior}-1}(1-x)^{n-s+\beta \operatorname{prior}-1} /\Beta(\alpha \operatorname{prior}, \beta \operatorname{prior})\right) \, dx} \\[6pt]
= {} & \frac{x^{s+\alpha \operatorname{prior}-1}(1-x)^{n-s+\beta \operatorname{prior}-1}}{\int_0^1 \left(x^{s+\alpha \operatorname{prior}-1}(1-x)^{n-s+\beta \operatorname{prior}-1}\right) \, dx} \\[6pt]
= {} & \frac{x^{s+\alpha \operatorname{prior}-1}(1-x)^{n-s+\beta \operatorname{prior}-1}}{\Beta(s+\alpha \operatorname{prior},n-s+\beta \operatorname{prior})}.
\end{align}</math>

The [[binomial coefficient]]

:<math>{s+f \choose s}={n \choose s}=\frac{(s+f)!}{s! f!}=\frac{n!}{s!(n-s)!}</math>

appears both in the numerator and the denominator of the posterior probability, and it does not depend on the integration variable ''x'', hence it cancels out, and it is irrelevant to the final result.  Similarly the normalizing factor for the prior probability, the beta function B(αPrior,βPrior) cancels out and it is immaterial to the final result. The same posterior probability result can be obtained if one uses an un-normalized prior

:<math>x^{\alpha \operatorname{prior}-1}(1-x)^{\beta \operatorname{prior}-1}</math>

because the normalizing factors all cancel out. Several authors (including Jeffreys himself) thus use an un-normalized prior formula since the normalization constant cancels out.  The numerator of the posterior probability ends up being just the (un-normalized) product of the prior probability and the likelihood function, and the denominator is its integral from zero to one. The beta function in the denominator, B(''s''&nbsp;+&nbsp;''α''&nbsp;Prior,&nbsp;''n''&nbsp;−&nbsp;''s''&nbsp;+&nbsp;''β''&nbsp;Prior), appears as a normalization constant to ensure that the total posterior probability integrates to unity.

The ratio ''s''/''n'' of the number of successes to the total number of trials is a [[sufficient statistic]] in the binomial case, which is relevant for the following results.

For the '''Bayes'''' prior probability (Beta(1,1)), the posterior probability is:

:<math>\operatorname{posterior probability}(p=x\mid s,f) = \frac{x^{s}(1-x)^{n-s}}{\Beta(s+1,n-s+1)}, \text{ with mean }=\frac{s+1}{n+2},\text{ (and mode}=\frac{s}{n}\text{ if } 0 < s < n).</math>

For the '''Jeffreys'''' prior probability (Beta(1/2,1/2)), the posterior probability is:

:<math>\operatorname{posterior probability}(p=x\mid s,f) = {x^{s-\tfrac{1}{2}}(1-x)^{n-s-\frac{1}{2}} \over \Beta(s+\tfrac{1}{2},n-s+\tfrac{1}{2})} ,\text{ with mean} = \frac{s+\tfrac{1}{2}}{n+1},\text{ (and mode}=\frac{s-\tfrac{1}{2}}{n-1}\text{ if } \tfrac{1}{2} < s < n-\tfrac{1}{2}).</math>

and for the '''Haldane''' prior probability (Beta(0,0)), the posterior probability is:

:<math>\operatorname{posterior probability}(p=x\mid s,f) = \frac{x^{s-1}(1-x)^{n-s-1}}{\Beta(s,n-s)}, \text{ with mean} = \frac{s}{n},\text{ (and mode}=\frac{s-1}{n-2}\text{ if } 1 < s < n -1).</math>

From the above expressions it follows that for ''s''/''n''&nbsp;=&nbsp;1/2) all the above three prior probabilities result in the identical location for the posterior probability mean&nbsp;=&nbsp;mode&nbsp;=&nbsp;1/2.  For ''s''/''n''&nbsp;<&nbsp;1/2, the mean of the posterior probabilities, using the following priors, are such that: mean for Bayes prior >&nbsp;mean for Jeffreys prior >&nbsp;mean for Haldane prior. For ''s''/''n''&nbsp;>&nbsp;1/2 the order of these inequalities is reversed such that the Haldane prior probability results in the largest posterior mean. The ''Haldane'' prior probability Beta(0,0) results in a posterior probability density with ''mean'' (the expected value for the probability of success in the "next" trial) identical to the ratio ''s''/''n'' of the number of successes to the total number of trials. Therefore, the Haldane prior results in a posterior probability with expected value in the next trial equal to the maximum likelihood. The ''Bayes'' prior probability Beta(1,1) results in a posterior probability density with ''mode'' identical to the ratio ''s''/''n'' (the maximum likelihood).

In the case that 100% of the trials have been successful ''s''&nbsp;=&nbsp;''n'', the ''Bayes'' prior probability Beta(1,1) results in a posterior expected value equal to the rule of succession (''n''&nbsp;+&nbsp;1)/(''n''&nbsp;+&nbsp;2), while the Haldane prior Beta(0,0) results in a posterior expected value of 1 (absolute certainty of success in the next trial).  Jeffreys prior probability results in a posterior expected value equal to (''n''&nbsp;+&nbsp;1/2)/(''n''&nbsp;+&nbsp;1). Perks<ref name=Perks/> (p.&nbsp;303) points out: "This provides a new rule of succession and expresses a 'reasonable' position to take up, namely, that after an unbroken run of n successes we assume a probability for the next trial equivalent to the assumption that we are about half-way through an average run, i.e. that we expect a failure once in (2''n''&nbsp;+&nbsp;2) trials. The Bayes–Laplace rule implies that we are about at the end of an average run or that we expect a failure once in (''n''&nbsp;+&nbsp;2) trials. The comparison clearly favours the new result (what is now called Jeffreys prior) from the point of view of 'reasonableness'."

Conversely, in the case that 100% of the trials have resulted in failure (''s''&nbsp;=&nbsp;0), the ''Bayes'' prior probability Beta(1,1) results in a posterior expected value for success in the next trial equal to 1/(''n''&nbsp;+&nbsp;2), while the Haldane prior Beta(0,0) results in a posterior expected value of success in the next trial of 0 (absolute certainty of failure in the next trial). Jeffreys prior probability results in a posterior expected value for success in the next trial equal to (1/2)/(''n''&nbsp;+&nbsp;1), which Perks<ref name=Perks/> (p.&nbsp;303) points out: "is a much more reasonably remote result than the Bayes–Laplace result&nbsp;1/(''n''&nbsp;+&nbsp;2)".

Jaynes<ref name=Jaynes/> questions (for the uniform prior Beta(1,1)) the use of these formulas for the cases ''s''&nbsp;=&nbsp;0 or ''s''&nbsp;=&nbsp;''n'' because the integrals do not converge (Beta(1,1) is an improper prior for ''s''&nbsp;=&nbsp;0 or ''s''&nbsp;=&nbsp;''n''). In practice, the conditions 0<s<n necessary for a mode to exist between both ends for the Bayes prior are usually met, and therefore the Bayes prior (as long as 0&nbsp;<&nbsp;''s''&nbsp;<&nbsp;''n'') results in a posterior mode located between both ends of the domain.

As remarked in the section on the rule of succession, K. Pearson showed that after ''n'' successes in ''n'' trials the posterior probability (based on the Bayes Beta(1,1) distribution as the prior probability) that the next (''n''&nbsp;+&nbsp;1) trials will all be successes is exactly 1/2, whatever the value of&nbsp;''n''. Based on the Haldane Beta(0,0) distribution as the prior probability, this posterior probability is 1 (absolute certainty that after n successes in ''n'' trials the next (''n''&nbsp;+&nbsp;1) trials will all be successes). Perks<ref name=Perks/> (p.&nbsp;303) shows that, for what is now known as the Jeffreys prior, this probability is ((''n''&nbsp;+&nbsp;1/2)/(''n''&nbsp;+&nbsp;1))((''n''&nbsp;+&nbsp;3/2)/(''n''&nbsp;+&nbsp;2))...(2''n''&nbsp;+&nbsp;1/2)/(2''n''&nbsp;+&nbsp;1), which for ''n''&nbsp;=&nbsp;1,&nbsp;2,&nbsp;3 gives 15/24, 315/480, 9009/13440; rapidly approaching a limiting value of <math>1/\sqrt{2} = 0.70710678\ldots</math> as n tends to infinity.  Perks remarks that what is now known as the Jeffreys prior: "is clearly more 'reasonable' than either the Bayes–Laplace result or the result on the (Haldane) alternative rule rejected by Jeffreys which gives certainty as the probability. It clearly provides a very much better correspondence with the process of induction. Whether it is 'absolutely' reasonable for the purpose, i.e. whether it is yet large enough, without the absurdity of reaching unity, is a matter for others to decide. But it must be realized that the result depends on the assumption of complete indifference and absence of knowledge prior to the sampling experiment."

Following are the variances of the posterior distribution obtained with these three prior probability distributions:

for the '''Bayes'''' prior probability (Beta(1,1)), the posterior variance is:

:<math>\text{variance} = \frac{(n-s+1)(s+1)}{(3+n)(2+n)^2},\text{ which for  } s=\frac{n}{2} \text{ results in variance} =\frac{1}{12+4n}</math>

for the '''Jeffreys'''' prior probability (Beta(1/2,1/2)), the posterior variance is:

: <math>\text{variance} = \frac{(n-s+\frac{1}{2})(s+\frac{1}{2})}{(2+n)(1+n)^2} ,\text{ which for } s=\frac n 2 \text{ results in var} = \frac 1 {8 + 4n}</math>

and for the '''Haldane''' prior probability (Beta(0,0)), the posterior variance is:

:<math>\text{variance} = \frac{(n-s)s}{(1+n)n^2}, \text{ which for  }s=\frac{n}{2}\text{ results in variance} =\frac{1}{4+4n}</math>

So, as remarked by Silvey,<ref name=Silvey/> for large ''n'', the variance is small and hence the posterior distribution is highly concentrated, whereas the assumed prior distribution was very diffuse.  This is in accord with what one would hope for, as vague prior knowledge is transformed (through Bayes' theorem) into a more precise posterior knowledge by an informative experiment.  For small ''n'' the Haldane Beta(0,0) prior results in the largest posterior variance while the Bayes Beta(1,1) prior results in the more concentrated posterior.  Jeffreys prior Beta(1/2,1/2) results in a posterior variance in between the other two.  As ''n'' increases, the variance rapidly decreases so that the posterior variance for all three priors converges to approximately the same value (approaching zero variance as ''n'' → ∞). Recalling the previous result that the ''Haldane'' prior probability Beta(0,0) results in a posterior probability density with ''mean'' (the expected value for the probability of success in the "next" trial) identical to the ratio s/n of the number of successes to the total number of trials, it follows from the above expression that also the ''Haldane'' prior Beta(0,0) results in a posterior with ''variance'' identical to the variance expressed in terms of the max. likelihood estimate s/n and sample size (in {{section link||Variance}}):

:<math>\text{variance} = \frac{\mu(1-\mu)}{1 + \nu}= \frac{(n-s)s}{(1+n) n^2} </math>

with the mean ''μ''&nbsp;=&nbsp;''s''/''n'' and the sample size&nbsp;''ν''&nbsp;=&nbsp;''n''.

In Bayesian inference, using a [[prior distribution]] Beta(''α''Prior,''β''Prior) prior to a binomial distribution is equivalent to adding (''α''Prior&nbsp;−&nbsp;1) pseudo-observations of "success" and (''β''Prior&nbsp;−&nbsp;1) pseudo-observations of "failure" to the actual number of successes and failures observed, then estimating the parameter ''p'' of the binomial distribution by the proportion of successes over both real- and pseudo-observations.  A uniform prior Beta(1,1) does not add (or subtract) any pseudo-observations since for Beta(1,1) it follows that (''α''Prior&nbsp;−&nbsp;1)&nbsp;=&nbsp;0 and (''β''Prior&nbsp;−&nbsp;1)&nbsp;=&nbsp;0. The Haldane prior Beta(0,0) subtracts one pseudo observation from each and Jeffreys prior Beta(1/2,1/2) subtracts 1/2 pseudo-observation of success and an equal number of failure. This subtraction has the effect of [[smoothing]] out the posterior distribution.  If the proportion of successes is not 50% (''s''/''n''&nbsp;≠&nbsp;1/2) values of ''α''Prior and ''β''Prior less than&nbsp;1 (and therefore negative (''α''Prior&nbsp;−&nbsp;1) and (''β''Prior&nbsp;−&nbsp;1)) favor sparsity, i.e. distributions where the parameter ''p'' is closer to either 0 or&nbsp;1.  In effect, values of ''α''Prior and ''β''Prior between 0 and 1, when operating together, function as a [[concentration parameter]].

The accompanying plots show the posterior probability density functions for sample sizes ''n''&nbsp;&isin;&nbsp;{3,10,50}, successes ''s''&nbsp;&isin;&nbsp;{''n''/2,''n''/4} and Beta(''α''Prior,''β''Prior)&nbsp;&isin;&nbsp;{Beta(0,0),Beta(1/2,1/2),Beta(1,1)}. Also shown are the cases for ''n''&nbsp;=&nbsp;{4,12,40}, success ''s''&nbsp;=&nbsp;{''n''/4} and Beta(''α''Prior,''β''Prior)&nbsp;&isin;&nbsp;{Beta(0,0),Beta(1/2,1/2),Beta(1,1)}. The first plot shows the symmetric cases, for successes ''s''&nbsp;&isin;&nbsp;{n/2}, with mean&nbsp;=&nbsp;mode&nbsp;=&nbsp;1/2 and the second plot shows the skewed cases ''s''&nbsp;&isin;&nbsp;{''n''/4}.  The images show that there is little difference between the priors for the posterior with sample size of 50 (characterized by a more pronounced peak near ''p''&nbsp;=&nbsp;1/2). Significant differences appear for very small sample sizes (in particular for the flatter distribution for the degenerate case of sample size&nbsp;=&nbsp;3). Therefore, the skewed cases, with successes ''s''&nbsp;=&nbsp;{''n''/4}, show a larger effect from the choice of prior, at small sample size, than the symmetric cases.  For symmetric distributions, the Bayes prior Beta(1,1) results in the most "peaky" and highest posterior distributions and the Haldane prior Beta(0,0) results in the flattest and lowest peak distribution.  The Jeffreys prior Beta(1/2,1/2) lies in between them.  For nearly symmetric, not too skewed distributions the effect of the priors is similar.  For very small sample size (in this case for a sample size of 3) and skewed distribution (in this example for ''s''&nbsp;&isin;&nbsp;{''n''/4}) the Haldane prior can result in a reverse-J-shaped distribution with a singularity at the left end.  However, this happens only in degenerate cases (in this example ''n''&nbsp;=&nbsp;3 and hence ''s''&nbsp;=&nbsp;3/4&nbsp;<&nbsp;1, a degenerate value because s should be greater than unity in order for the posterior of the Haldane prior to have a mode located between the ends, and because ''s''&nbsp;=&nbsp;3/4 is not an integer number, hence it violates the initial assumption of a binomial distribution for the likelihood) and it is not an issue in generic cases of reasonable sample size (such that the condition 1&nbsp;<&nbsp;''s''&nbsp;<&nbsp;''n''&nbsp;−&nbsp;1, necessary for a mode to exist between both ends, is fulfilled).

In Chapter 12 (p.&nbsp;385) of his book, Jaynes<ref name=Jaynes/> asserts that the ''Haldane prior'' Beta(0,0) describes a ''prior state of knowledge of complete ignorance'', where we are not even sure whether it is physically possible for an experiment to yield either a success or a failure, while the ''Bayes (uniform) prior Beta(1,1) applies if'' one knows that ''both binary outcomes are possible''. Jaynes states: "''interpret the Bayes–Laplace (Beta(1,1)) prior as describing not a state of complete ignorance'', but the state of knowledge in which we have observed one success and one failure...once we have seen at least one success and one failure, then we know that the experiment is a true binary one, in the sense of physical possibility." Jaynes <ref name=Jaynes/> does not specifically discuss Jeffreys prior Beta(1/2,1/2) (Jaynes discussion of "Jeffreys prior" on pp.&nbsp;181, 423 and on chapter 12 of Jaynes book<ref name=Jaynes/> refers instead to the improper, un-normalized, prior "1/''p''&nbsp;''dp''" introduced by Jeffreys in the 1939 edition of his book,<ref name=Jeffreys/> seven years before he introduced what is now known as Jeffreys' invariant prior: the square root of the determinant of Fisher's information matrix. ''"1/p" is Jeffreys' (1946) invariant prior for the [[exponential distribution]], not for the Bernoulli or binomial distributions''). However, it follows from the above discussion that Jeffreys Beta(1/2,1/2) prior represents a state of knowledge in between the Haldane Beta(0,0) and Bayes Beta (1,1) prior.

Similarly, [[Karl Pearson]] in his 1892 book [[The Grammar of Science]]<ref name=PearsonGrammar>{{cite book| last=Pearson|first=Karl|title=The Grammar of Science|year=1892|publisher=Walter Scott, London|url=https://books.google.com/books?id=IvdsEcFwcnsC&q=grammar+of+science&pg=PR19}}</ref><ref name=PearsnGrammar2009>{{cite book|last=Pearson|first=Karl|title=The Grammar of Science|year=2009|publisher=BiblioLife|isbn=978-1110356119}}</ref> (p.&nbsp;144 of 1900 edition)  maintained that the Bayes (Beta(1,1) uniform prior was not a complete ignorance prior, and that it should be used when prior information justified to "distribute our ignorance equally"".  K. Pearson wrote: "Yet the only supposition that we appear to have made is this: that, knowing nothing of nature, routine and anomy (from the Greek ανομία, namely: a- "without", and nomos "law") are to be considered as equally likely to occur.  Now we were not really justified in making even this assumption, for it involves a knowledge that we do not possess regarding nature.  We use our ''experience'' of the constitution and action of coins in general to assert that heads and tails are equally probable, but we have no right to assert before experience that, as we know nothing of nature, routine and breach are equally probable. In our ignorance we ought to consider before experience that nature may consist of all routines, all anomies (normlessness), or a mixture of the two in any proportion whatever, and that all such are equally probable. Which of these constitutions after experience is the most probable must clearly depend on what that experience has been like."

If there is sufficient [[Sample (statistics)|sampling data]], ''and the posterior probability mode is not located at one of the extremes of the domain'' (''x''&nbsp;=&nbsp;0 or ''x''&nbsp;=&nbsp;1), the three priors of Bayes (Beta(1,1)), Jeffreys (Beta(1/2,1/2)) and Haldane (Beta(0,0)) should yield similar [[posterior probability|''posterior'' probability]] densities.  Otherwise, as Gelman et al.<ref name=Gelman>{{cite book|last=Gelman|first=A., Carlin, J. B., Stern, H. S., and Rubin, D. B.|title=Bayesian Data Analysis| year=2003|publisher=Chapman and Hall/CRC|isbn=978-1584883883}}</ref> (p.&nbsp;65) point out, "if so few data are available that the choice of noninformative prior distribution makes a difference, one should put relevant information into the prior distribution", or as Berger<ref name=BergerDecisionTheory/> (p.&nbsp;125) points out "when different reasonable priors yield substantially different answers, can it be right to state that there ''is'' a single answer? Would it not be better to admit that there is scientific uncertainty, with the conclusion depending on prior beliefs?."