Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Beta distribution
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Statistical inference == ===Parameter estimation=== ====Method of moments==== =====Two unknown parameters===== Two unknown parameters (<math> (\hat{\alpha}, \hat{\beta})</math> of a beta distribution supported in the [0,1] interval) can be estimated, using the method of moments, with the first two moments (sample mean and sample variance) as follows. Let: : <math>\text{sample mean(X)}=\bar{x} = \frac{1}{N}\sum_{i=1}^N X_i</math> be the [[sample mean]] estimate and : <math>\text{sample variance(X)} =\bar{v} = \frac{1}{N-1}\sum_{i=1}^N (X_i - \bar{x})^2</math> be the [[sample variance]] estimate. The [[method of moments (statistics)|method-of-moments]] estimates of the parameters are :<math>\hat{\alpha} = \bar{x} \left(\frac{\bar{x} (1 - \bar{x})}{\bar{v}} - 1 \right),</math> if <math>\bar{v} <\bar{x}(1 - \bar{x}),</math> : <math>\hat{\beta} = (1-\bar{x}) \left(\frac{\bar{x} (1 - \bar{x})}{\bar{v}} - 1 \right),</math> if <math>\bar{v}<\bar{x}(1 - \bar{x}).</math> <!-- MLE's should be in this section too. Maybe I'll be back.... --> When the distribution is required over a known interval other than [0, 1] with random variable ''X'', say [''a'', ''c''] with random variable ''Y'', then replace <math>\bar{x}</math> with <math>\frac{\bar{y}-a}{c-a},</math> and <math>\bar{v}</math> with <math>\frac{\bar{v_Y}}{(c-a)^2}</math> in the above couple of equations for the shape parameters (see the "Four unknown parameters" section below),<ref>{{Cite web|url=https://www.itl.nist.gov/div898/handbook/eda/section3/eda366h.htm|title=1.3.6.6.17. Beta Distribution|website=www.itl.nist.gov}}</ref> where: : <math>\text{sample mean(Y)}=\bar{y} = \frac{1}{N}\sum_{i=1}^N Y_i</math> : <math>\text{sample variance(Y)} = \bar{v_Y} = \frac{1}{N-1}\sum_{i=1}^N (Y_i - \bar{y})^2</math> =====Four unknown parameters===== [[File:(alpha and beta) Parameter estimates vs. excess Kurtosis and (squared) Skewness Beta distribution - J. Rodal.png|thumb|Solutions for parameter estimates vs. (sample) excess Kurtosis and (sample) squared Skewness Beta distribution]] All four parameters (<math>\hat{\alpha}, \hat{\beta}, \hat{a}, \hat{c}</math> of a beta distribution supported in the [''a'', ''c''] interval, see section [[Beta distribution#Four parameters|"Alternative parametrizations, Four parameters"]]) can be estimated, using the method of moments developed by [[Karl Pearson]], by equating sample and population values of the first four central moments (mean, variance, skewness and excess kurtosis).<ref name=JKB/><ref name=Elderton1906/><ref name="Elderton and Johnson">{{cite book|last=Elderton|first=William Palin and Norman Lloyd Johnson|title=Systems of Frequency Curves|year=2009|publisher=Cambridge University Press|isbn=978-0521093361}}</ref> The excess kurtosis was expressed in terms of the square of the skewness, and the sample size ν = α + β, (see previous section [[Beta distribution#Kurtosis|"Kurtosis"]]) as follows: :<math>\text{excess kurtosis} =\frac{6}{3 + \nu}\left(\frac{(2 + \nu)}{4} (\text{skewness})^2 - 1\right)\text{ if (skewness)}^2-2< \text{excess kurtosis}< \tfrac{3}{2} (\text{skewness})^2</math> One can use this equation to solve for the sample size ν= α + β in terms of the square of the skewness and the excess kurtosis as follows:<ref name=Elderton1906/> :<math>\hat{\nu} = \hat{\alpha} + \hat{\beta} = 3\frac{(\text{sample excess kurtosis}) - (\text{sample skewness})^2+2}{\frac{3}{2} (\text{sample skewness})^2 - \text{(sample excess kurtosis)}}</math> :<math>\text{ if (sample skewness)}^2-2< \text{sample excess kurtosis}< \tfrac{3}{2} (\text{sample skewness})^2</math> This is the ratio (multiplied by a factor of 3) between the previously derived limit boundaries for the beta distribution in a space (as originally done by Karl Pearson<ref name=Pearson />) defined with coordinates of the square of the skewness in one axis and the excess kurtosis in the other axis (see {{section link||Kurtosis bounded by the square of the skewness}}): The case of zero skewness, can be immediately solved because for zero skewness, α = β and hence ν = 2α = 2β, therefore α = β = ν/2 : <math>\hat{\alpha} = \hat{\beta} = \frac{\hat{\nu}}{2}= \frac{\frac{3}{2}(\text{sample excess kurtosis}) +3}{- \text{(sample excess kurtosis)}}</math> : <math> \text{ if sample skewness}= 0 \text{ and } -2<\text{sample excess kurtosis}<0</math> (Excess kurtosis is negative for the beta distribution with zero skewness, ranging from -2 to 0, so that <math>\hat{\nu}</math> -and therefore the sample shape parameters- is positive, ranging from zero when the shape parameters approach zero and the excess kurtosis approaches -2, to infinity when the shape parameters approach infinity and the excess kurtosis approaches zero). For non-zero sample skewness one needs to solve a system of two coupled equations. Since the skewness and the excess kurtosis are independent of the parameters <math>\hat{a}, \hat{c}</math>, the parameters <math>\hat{\alpha}, \hat{\beta}</math> can be uniquely determined from the sample skewness and the sample excess kurtosis, by solving the coupled equations with two known variables (sample skewness and sample excess kurtosis) and two unknowns (the shape parameters): :<math>(\text{sample skewness})^2 = \frac{4(\hat{\beta}-\hat{\alpha})^2 (1 + \hat{\alpha} + \hat{\beta})}{\hat{\alpha} \hat{\beta} (2 + \hat{\alpha} + \hat{\beta})^2}</math> :<math>\text{sample excess kurtosis} =\frac{6}{3 + \hat{\alpha} + \hat{\beta}}\left(\frac{(2 + \hat{\alpha} + \hat{\beta})}{4} (\text{sample skewness})^2 - 1\right)</math> :<math>\text{ if (sample skewness)}^2-2< \text{sample excess kurtosis}< \tfrac{3}{2}(\text{sample skewness})^2</math> resulting in the following solution:<ref name=Elderton1906/> : <math>\hat{\alpha}, \hat{\beta} = \frac{\hat{\nu}}{2} \left (1 \pm \frac{1}{ \sqrt{1+ \frac{16 (\hat{\nu} + 1)}{(\hat{\nu} + 2)^2(\text{sample skewness})^2}}} \right )</math> : <math>\text{ if sample skewness}\neq 0 \text{ and } (\text{sample skewness})^2-2< \text{sample excess kurtosis}< \tfrac{3}{2} (\text{sample skewness})^2</math> Where one should take the solutions as follows: <math>\hat{\alpha}>\hat{\beta}</math> for (negative) sample skewness < 0, and <math>\hat{\alpha}<\hat{\beta}</math> for (positive) sample skewness > 0. The accompanying plot shows these two solutions as surfaces in a space with horizontal axes of (sample excess kurtosis) and (sample squared skewness) and the shape parameters as the vertical axis. The surfaces are constrained by the condition that the sample excess kurtosis must be bounded by the sample squared skewness as stipulated in the above equation. The two surfaces meet at the right edge defined by zero skewness. Along this right edge, both parameters are equal and the distribution is symmetric U-shaped for α = β < 1, uniform for α = β = 1, upside-down-U-shaped for 1 < α = β < 2 and bell-shaped for α = β > 2. The surfaces also meet at the front (lower) edge defined by "the impossible boundary" line (excess kurtosis + 2 - skewness<sup>2</sup> = 0). Along this front (lower) boundary both shape parameters approach zero, and the probability density is concentrated more at one end than the other end (with practically nothing in between), with probabilities <math>p=\tfrac{\beta}{\alpha + \beta}</math> at the left end ''x'' = 0 and <math>q = 1-p = \tfrac{\alpha}{\alpha + \beta} </math> at the right end ''x'' = 1. The two surfaces become further apart towards the rear edge. At this rear edge the surface parameters are quite different from each other. As remarked, for example, by Bowman and Shenton,<ref name="BowmanShenton">{{cite journal|last=Bowman|first=K. O.|author1-link=Kimiko O. Bowman|author2=Shenton, L. R.|title=The beta distribution, moment method, Karl Pearson and R.A. Fisher|journal=Far East J. Theo. Stat.|year=2007|volume=23|issue=2|pages=133–164| url=http://www.csm.ornl.gov/~bowman/fjts232.pdf }}</ref> sampling in the neighborhood of the line (sample excess kurtosis - (3/2)(sample skewness)<sup>2</sup> = 0) (the just-J-shaped portion of the rear edge where blue meets beige), "is dangerously near to chaos", because at that line the denominator of the expression above for the estimate ν = α + β becomes zero and hence ν approaches infinity as that line is approached. Bowman and Shenton <ref name="BowmanShenton" /> write that "the higher moment parameters (kurtosis and skewness) are extremely fragile (near that line). However, the mean and standard deviation are fairly reliable." Therefore, the problem is for the case of four parameter estimation for very skewed distributions such that the excess kurtosis approaches (3/2) times the square of the skewness. This boundary line is produced by extremely skewed distributions with very large values of one of the parameters and very small values of the other parameter. See {{section link||Kurtosis bounded by the square of the skewness}} for a numerical example and further comments about this rear edge boundary line (sample excess kurtosis - (3/2)(sample skewness)<sup>2</sup> = 0). As remarked by Karl Pearson himself <ref name=Pearson1936/> this issue may not be of much practical importance as this trouble arises only for very skewed J-shaped (or mirror-image J-shaped) distributions with very different values of shape parameters that are unlikely to occur much in practice). The usual skewed-bell-shape distributions that occur in practice do not have this parameter estimation problem. The remaining two parameters <math>\hat{a}, \hat{c}</math> can be determined using the sample mean and the sample variance using a variety of equations.<ref name="JKB"/><ref name=Elderton1906/> One alternative is to calculate the support interval range <math>(\hat{c}-\hat{a})</math> based on the sample variance and the sample kurtosis. For this purpose one can solve, in terms of the range <math>(\hat{c}- \hat{a})</math>, the equation expressing the excess kurtosis in terms of the sample variance, and the sample size ν (see {{section link||Kurtosis }} and {{section link||Alternative parametrizations, four parameters}}): :<math>\text{sample excess kurtosis} =\frac{6}{(3 + \hat{\nu})(2 + \hat{\nu})}\bigg(\frac{(\hat{c}- \hat{a})^2}{\text{(sample variance)}} - 6 - 5 \hat{\nu} \bigg)</math> to obtain: :<math> (\hat{c}- \hat{a}) = \sqrt{\text{(sample variance)}}\sqrt{6+5\hat{\nu}+\frac{(2+\hat{\nu})(3+\hat{\nu})}{6}\text{(sample excess kurtosis)}}</math> Another alternative is to calculate the support interval range <math>(\hat{c}-\hat{a})</math> based on the sample variance and the sample skewness.<ref name=Elderton1906/> For this purpose one can solve, in terms of the range <math>(\hat{c}-\hat{a})</math>, the equation expressing the squared skewness in terms of the sample variance, and the sample size ν (see section titled "Skewness" and "Alternative parametrizations, four parameters"): :<math>(\text{sample skewness})^2 = \frac{4}{(2+\hat{\nu})^2}\bigg(\frac{(\hat{c}- \hat{a})^2}{ \text{(sample variance)}}-4(1+\hat{\nu})\bigg)</math> to obtain:<ref name=Elderton1906/> :<math> (\hat{c}- \hat{a}) = \frac{\sqrt{\text{(sample variance)}}}{2}\sqrt{(2+\hat{\nu})^2(\text{sample skewness})^2+16(1+\hat{\nu})}</math> The remaining parameter can be determined from the sample mean and the previously obtained parameters: <math>(\hat{c}-\hat{a}), \hat{\alpha}, \hat{\nu} = \hat{\alpha}+\hat{\beta}</math>: :<math> \hat{a} = (\text{sample mean}) - \left(\frac{\hat{\alpha}}{\hat{\nu}}\right)(\hat{c}-\hat{a}) </math> and finally, <math>\hat{c}= (\hat{c}- \hat{a}) + \hat{a} </math>. In the above formulas one may take, for example, as estimates of the sample moments: :<math>\begin{align} \text{sample mean} &=\overline{y} = \frac{1}{N}\sum_{i=1}^N Y_i \\ \text{sample variance} &= \overline{v}_Y = \frac{1}{N-1}\sum_{i=1}^N (Y_i - \overline{y})^2 \\ \text{sample skewness} &= G_1 = \frac{N}{(N-1)(N-2)} \frac{\sum_{i=1}^N (Y_i-\overline{y})^3}{\overline{v}_Y^{\frac{3}{2}} } \\ \text{sample excess kurtosis} &= G_2 = \frac{N(N+1)}{(N-1)(N-2)(N-3)} \frac{\sum_{i=1}^N (Y_i - \overline{y})^4}{\overline{v}_Y^2} - \frac{3(N-1)^2}{(N-2)(N-3)} \end{align}</math> The estimators ''G''<sub>1</sub> for [[skewness|sample skewness]] and ''G''<sub>2</sub> for [[kurtosis|sample kurtosis]] are used by [[DAP (software)|DAP]]/[[SAS System|SAS]], [[PSPP]]/[[SPSS]], and [[Microsoft Excel|Excel]]. However, they are not used by [[BMDP]] and (according to <ref name="Joanes and Gill"/>) they were not used by [[MINITAB]] in 1998. Actually, Joanes and Gill in their 1998 study<ref name="Joanes and Gill">{{cite journal|last=Joanes|first=D. N.|author2=C. A. Gill|title=Comparing measures of sample skewness and kurtosis|journal=The Statistician|year=1998|volume=47|issue=Part 1|pages=183–189|doi=10.1111/1467-9884.00122}}</ref> concluded that the skewness and kurtosis estimators used in [[BMDP]] and in [[MINITAB]] (at that time) had smaller variance and mean-squared error in normal samples, but the skewness and kurtosis estimators used in [[DAP (software)|DAP]]/[[SAS System|SAS]], [[PSPP]]/[[SPSS]], namely ''G''<sub>1</sub> and ''G''<sub>2</sub>, had smaller mean-squared error in samples from a very skewed distribution. It is for this reason that we have spelled out "sample skewness", etc., in the above formulas, to make it explicit that the user should choose the best estimator according to the problem at hand, as the best estimator for skewness and kurtosis depends on the amount of skewness (as shown by Joanes and Gill<ref name="Joanes and Gill"/>). ====Maximum likelihood==== =====Two unknown parameters===== [[File:Max (Joint Log Likelihood per N) for Beta distribution Maxima at alpha=beta=2 - J. Rodal.png|thumb|Max (joint log likelihood/''N'') for beta distribution maxima at ''α'' = ''β'' = 2]] [[File:Max (Joint Log Likelihood per N) for Beta distribution Maxima at alpha=beta= 0.25,0.5,1,2,4,6,8 - J. Rodal.png|thumb|Max (joint log likelihood/''N'') for Beta distribution maxima at ''α'' = ''β'' ∈ {0.25,0.5,1,2,4,6,8}]] As is also the case for [[maximum likelihood]] estimates for the [[gamma distribution]], the maximum likelihood estimates for the beta distribution do not have a general closed form solution for arbitrary values of the shape parameters. If ''X''<sub>1</sub>, ..., ''X<sub>N</sub>'' are independent random variables each having a beta distribution, the joint log likelihood function for ''N'' [[independent and identically distributed random variables|iid]] observations is: :<math>\begin{align} \ln\, \mathcal{L} (\alpha, \beta\mid X) &= \sum_{i=1}^N \ln \left (\mathcal{L}_i (\alpha, \beta\mid X_i) \right )\\ &= \sum_{i=1}^N \ln \left (f(X_i;\alpha,\beta) \right ) \\ &= \sum_{i=1}^N \ln \left (\frac{X_i^{\alpha-1}(1-X_i)^{\beta-1}}{\Beta(\alpha,\beta)} \right ) \\ &= (\alpha - 1)\sum_{i=1}^N \ln (X_i) + (\beta- 1)\sum_{i=1}^N \ln (1-X_i) - N \ln \Beta(\alpha,\beta) \end{align}</math> Finding the maximum with respect to a shape parameter involves taking the partial derivative with respect to the shape parameter and setting the expression equal to zero yielding the [[maximum likelihood]] estimator of the shape parameters: :<math>\frac{\partial \ln \mathcal{L}(\alpha,\beta\mid X)}{\partial \alpha} = \sum_{i=1}^N \ln X_i -N\frac{\partial \ln \Beta(\alpha,\beta)}{\partial \alpha}=0</math> :<math>\frac{\partial \ln \mathcal{L}(\alpha,\beta\mid X)}{\partial \beta} = \sum_{i=1}^N \ln (1-X_i)- N\frac{\partial \ln \mathrm{B}(\alpha,\beta)}{\partial \beta}=0</math> where: :<math>\frac{\partial \ln \Beta(\alpha,\beta)}{\partial \alpha} = -\frac{\partial \ln \Gamma(\alpha+\beta)}{\partial \alpha}+ \frac{\partial \ln \Gamma(\alpha)}{\partial \alpha}+ \frac{\partial \ln \Gamma(\beta)}{\partial \alpha}=-\psi(\alpha + \beta) + \psi(\alpha) + 0</math> :<math>\frac{\partial \ln \Beta(\alpha,\beta)}{\partial \beta}= - \frac{\partial \ln \Gamma(\alpha+\beta)}{\partial \beta}+ \frac{\partial \ln \Gamma(\alpha)}{\partial \beta} + \frac{\partial \ln \Gamma(\beta)}{\partial \beta}=-\psi(\alpha + \beta) + 0 + \psi(\beta)</math> since the '''[[digamma function]]''' denoted ψ(α) is defined as the [[logarithmic derivative]] of the [[gamma function]]:<ref name=Abramowitz/> :<math>\psi(\alpha) =\frac {\partial\ln \Gamma(\alpha)}{\partial \alpha}</math> To ensure that the values with zero tangent slope are indeed a maximum (instead of a saddle-point or a minimum) one has to also satisfy the condition that the curvature is negative. This amounts to satisfying that the second partial derivative with respect to the shape parameters is negative :<math>\frac{\partial^2\ln \mathcal{L}(\alpha,\beta\mid X)}{\partial \alpha^2}= -N\frac{\partial^2\ln \Beta(\alpha,\beta)}{\partial \alpha^2}<0</math> :<math>\frac{\partial^2\ln \mathcal{L}(\alpha,\beta\mid X)}{\partial \beta^2} = -N\frac{\partial^2\ln \Beta(\alpha,\beta)}{\partial \beta^2}<0</math> using the previous equations, this is equivalent to: :<math>\frac{\partial^2\ln \Beta(\alpha,\beta)}{\partial \alpha^2} = \psi_1(\alpha)-\psi_1(\alpha + \beta) > 0</math> :<math>\frac{\partial^2\ln \Beta(\alpha,\beta)}{\partial \beta^2} = \psi_1(\beta) -\psi_1(\alpha + \beta) > 0</math> where the '''[[trigamma function]]''', denoted ''ψ''<sub>1</sub>(''α''), is the second of the [[polygamma function]]s, and is defined as the derivative of the [[digamma]] function: :<math>\psi_1(\alpha) = \frac{\partial^2\ln\Gamma(\alpha)}{\partial \alpha^2}=\, \frac{\partial\, \psi(\alpha)}{\partial \alpha}.</math> These conditions are equivalent to stating that the variances of the logarithmically transformed variables are positive, since: :<math>\operatorname{var}[\ln (X)] = \operatorname{E}[\ln^2 (X)] - (\operatorname{E}[\ln (X)])^2 = \psi_1(\alpha) - \psi_1(\alpha + \beta) </math> :<math>\operatorname{var}[\ln (1-X)] = \operatorname{E}[\ln^2 (1-X)] - (\operatorname{E}[\ln (1-X)])^2 = \psi_1(\beta) - \psi_1(\alpha + \beta) </math> Therefore, the condition of negative curvature at a maximum is equivalent to the statements: : <math> \operatorname{var}[\ln (X)] > 0</math> : <math> \operatorname{var}[\ln (1-X)] > 0</math> Alternatively, the condition of negative curvature at a maximum is also equivalent to stating that the following [[logarithmic derivative]]s of the [[geometric mean]]s ''G<sub>X</sub>'' and ''G<sub>(1−X)</sub>'' are positive, since: : <math>\psi_1(\alpha) - \psi_1(\alpha + \beta) = \frac{\partial \ln G_X}{\partial \alpha} > 0</math> : <math>\psi_1(\beta) - \psi_1(\alpha + \beta) = \frac{\partial \ln G_{(1-X)}}{\partial \beta} > 0</math> While these slopes are indeed positive, the other slopes are negative: :<math>\frac{\partial\, \ln G_X}{\partial \beta}, \frac{\partial \ln G_{(1-X)}}{\partial \alpha} < 0.</math> The slopes of the mean and the median with respect to ''α'' and ''β'' display similar sign behavior. From the condition that at a maximum, the partial derivative with respect to the shape parameter equals zero, we obtain the following system of coupled [[maximum likelihood estimate]] equations (for the average log-likelihoods) that needs to be inverted to obtain the (unknown) shape parameter estimates <math>\hat{\alpha},\hat{\beta}</math> in terms of the (known) average of logarithms of the samples ''X''<sub>1</sub>, ..., ''X<sub>N</sub>'':<ref name=JKB /> :<math>\begin{align} \hat{\operatorname{E}}[\ln (X)] &= \psi(\hat{\alpha}) - \psi(\hat{\alpha} + \hat{\beta})=\frac{1}{N}\sum_{i=1}^N \ln X_i = \ln \hat{G}_X \\ \hat{\operatorname{E}}[\ln(1-X)] &= \psi(\hat{\beta}) - \psi(\hat{\alpha} + \hat{\beta})=\frac{1}{N}\sum_{i=1}^N \ln (1-X_i)= \ln \hat{G}_{(1-X)} \end{align}</math> where we recognize <math>\log \hat{G}_X</math> as the logarithm of the sample [[geometric mean]] and <math>\log \hat{G}_{(1-X)}</math> as the logarithm of the sample [[geometric mean]] based on (1 − ''X''), the mirror-image of ''X''. For <math>\hat{\alpha}=\hat{\beta}</math>, it follows that <math>\hat{G}_X=\hat{G}_{(1-X)} </math>. :<math>\begin{align} \hat{G}_X &= \prod_{i=1}^N (X_i)^{1/N} \\ \hat{G}_{(1-X)} &= \prod_{i=1}^N (1-X_i)^{1/N} \end{align}</math> These coupled equations containing [[digamma function]]s of the shape parameter estimates <math>\hat{\alpha},\hat{\beta}</math> must be solved by numerical methods as done, for example, by Beckman et al.<ref>{{cite journal|last=Beckman|first=R. J.|author2=G. L. Tietjen|title=Maximum likelihood estimation for the beta distribution|journal=Journal of Statistical Computation and Simulation|year=1978|volume=7|issue=3–4|pages=253–258|doi=10.1080/00949657808810232}}</ref> Gnanadesikan et al. give numerical solutions for a few cases.<ref>{{cite journal |last=Gnanadesikan |first=R., Pinkham and Hughes|title=Maximum likelihood estimation of the parameters of the beta distribution from smallest order statistics |journal=Technometrics |year=1967|volume=9|issue=4|pages=607–620 |doi=10.2307/1266199|jstor=1266199}}</ref> [[Norman Lloyd Johnson|N.L.Johnson]] and [[Samuel Kotz|S.Kotz]]<ref name=JKB /> suggest that for "not too small" shape parameter estimates <math>\hat{\alpha},\hat{\beta}</math>, the logarithmic approximation to the digamma function <math>\psi(\hat{\alpha}) \approx \ln(\hat{\alpha}-\tfrac{1}{2})</math> may be used to obtain initial values for an iterative solution, since the equations resulting from this approximation can be solved exactly: :<math>\ln \frac{\hat{\alpha} - \frac{1}{2}}{\hat{\alpha} + \hat{\beta} - \frac{1}{2}} \approx \ln \hat{G}_X </math> :<math>\ln \frac{\hat{\beta} - \frac{1}{2}}{\hat{\alpha} + \hat{\beta} - \frac{1}{2}}\approx \ln \hat{G}_{(1-X)} </math> which leads to the following solution for the initial values (of the estimate shape parameters in terms of the sample geometric means) for an iterative solution: :<math>\hat{\alpha}\approx \tfrac{1}{2} + \frac{\hat{G}_{X}}{2(1-\hat{G}_X-\hat{G}_{(1-X)})} \text{ if } \hat{\alpha} >1</math> :<math>\hat{\beta}\approx \tfrac{1}{2} + \frac{\hat{G}_{(1-X)}}{2(1-\hat{G}_X-\hat{G}_{(1-X)})} \text{ if } \hat{\beta} > 1</math> Alternatively, the estimates provided by the method of moments can instead be used as initial values for an iterative solution of the maximum likelihood coupled equations in terms of the digamma functions. When the distribution is required over a known interval other than [0, 1] with random variable ''X'', say [''a'', ''c''] with random variable ''Y'', then replace ln(''X<sub>i</sub>'') in the first equation with :<math>\ln \frac{Y_i-a}{c-a}</math>, and replace ln(1−''X<sub>i</sub>'') in the second equation with :<math>\ln \frac{c-Y_i}{c-a}</math> (see "Alternative parametrizations, four parameters" section below). If one of the shape parameters is known, the problem is considerably simplified. The following [[logit]] transformation can be used to solve for the unknown shape parameter (for skewed cases such that <math>\hat{\alpha}\neq\hat{\beta}</math>, otherwise, if symmetric, both -equal- parameters are known when one is known): :<math>\hat{\operatorname{E}} \left[\ln \left(\frac{X}{1-X} \right) \right]=\psi(\hat{\alpha}) - \psi(\hat{\beta})=\frac{1}{N}\sum_{i=1}^N \ln\frac{X_i}{1-X_i} = \ln \hat{G}_X - \ln \left(\hat{G}_{(1-X)}\right) </math> This [[logit]] transformation is the logarithm of the transformation that divides the variable ''X'' by its mirror-image (''X''/(1 - ''X'') resulting in the "inverted beta distribution" or [[beta prime distribution]] (also known as beta distribution of the second kind or [[Pearson distribution|Pearson's Type VI]]) with support [0, +∞). As previously discussed in the section "Moments of logarithmically transformed random variables," the [[logit]] transformation <math>\ln\frac{X}{1-X}</math>, studied by Johnson,<ref name=JohnsonLogInv/> extends the finite support [0, 1] based on the original variable ''X'' to infinite support in both directions of the real line (−∞, +∞). If, for example, <math>\hat{\beta}</math> is known, the unknown parameter <math>\hat{\alpha}</math> can be obtained in terms of the inverse<ref name=invpsi.m>{{cite web|last=Fackler |first=Paul|title=Inverse Digamma Function (Matlab)|url=http://hips.seas.harvard.edu/content/inverse-digamma-function-matlab|publisher=Harvard University School of Engineering and Applied Sciences|access-date=2012-08-18}}</ref> digamma function of the right hand side of this equation: :<math>\psi(\hat{\alpha})=\frac{1}{N}\sum_{i=1}^N \ln\frac{X_i}{1-X_i} + \psi(\hat{\beta}) </math> :<math>\hat{\alpha}=\psi^{-1}(\ln \hat{G}_X - \ln \hat{G}_{(1-X)} + \psi(\hat{\beta})) </math> In particular, if one of the shape parameters has a value of unity, for example for <math>\hat{\beta} = 1</math> (the power function distribution with bounded support [0,1]), using the identity ψ(''x'' + 1) = ψ(''x'') + 1/''x'' in the equation <math>\psi(\hat{\alpha}) - \psi(\hat{\alpha} + \hat{\beta})= \ln \hat{G}_X</math>, the maximum likelihood estimator for the unknown parameter <math>\hat{\alpha}</math> is,<ref name=JKB /> exactly: :<math>\hat{\alpha}= - \frac{1}{\frac{1}{N}\sum_{i=1}^N \ln X_i}= - \frac{1}{ \ln \hat{G}_X} </math> The beta has support [0, 1], therefore <math>\hat{G}_X < 1</math>, and hence <math>(-\ln \hat{G}_X) >0</math>, and therefore <math>\hat{\alpha} >0.</math> In conclusion, the maximum likelihood estimates of the shape parameters of a beta distribution are (in general) a complicated function of the sample [[geometric mean]], and of the sample [[geometric mean]] based on ''(1−X)'', the mirror-image of ''X''. One may ask, if the variance (in addition to the mean) is necessary to estimate two shape parameters with the method of moments, why is the (logarithmic or geometric) variance not necessary to estimate two shape parameters with the maximum likelihood method, for which only the geometric means suffice? The answer is because the mean does not provide as much information as the geometric mean. For a beta distribution with equal shape parameters ''α'' = ''β'', the mean is exactly 1/2, regardless of the value of the shape parameters, and therefore regardless of the value of the statistical dispersion (the variance). On the other hand, the geometric mean of a beta distribution with equal shape parameters ''α'' = ''β'', depends on the value of the shape parameters, and therefore it contains more information. Also, the geometric mean of a beta distribution does not satisfy the symmetry conditions satisfied by the mean, therefore, by employing both the geometric mean based on ''X'' and geometric mean based on (1 − ''X''), the maximum likelihood method is able to provide best estimates for both parameters ''α'' = ''β'', without need of employing the variance. One can express the joint log likelihood per ''N'' [[independent and identically distributed random variables|iid]] observations in terms of the ''[[sufficient statistic]]s'' (the sample geometric means) as follows: :<math>\frac{\ln \mathcal{L} (\alpha, \beta\mid X)}{N} = (\alpha - 1)\ln \hat{G}_X + (\beta- 1)\ln \hat{G}_{(1-X)}- \ln \Beta(\alpha,\beta).</math> We can plot the joint log likelihood per ''N'' observations for fixed values of the sample geometric means to see the behavior of the likelihood function as a function of the shape parameters α and β. In such a plot, the shape parameter estimators <math>\hat{\alpha},\hat{\beta}</math> correspond to the maxima of the likelihood function. See the accompanying graph that shows that all the likelihood functions intersect at α = β = 1, which corresponds to the values of the shape parameters that give the maximum entropy (the maximum entropy occurs for shape parameters equal to unity: the uniform distribution). It is evident from the plot that the likelihood function gives sharp peaks for values of the shape parameter estimators close to zero, but that for values of the shape parameters estimators greater than one, the likelihood function becomes quite flat, with less defined peaks. Obviously, the maximum likelihood parameter estimation method for the beta distribution becomes less acceptable for larger values of the shape parameter estimators, as the uncertainty in the peak definition increases with the value of the shape parameter estimators. One can arrive at the same conclusion by noticing that the expression for the curvature of the likelihood function is in terms of the geometric variances :<math>\frac{\partial^2\ln \mathcal{L}(\alpha,\beta\mid X)}{\partial \alpha^2}= -\operatorname{var}[\ln X]</math> :<math>\frac{\partial^2\ln \mathcal{L}(\alpha,\beta\mid X)}{\partial \beta^2} = -\operatorname{var}[\ln (1-X)]</math> These variances (and therefore the curvatures) are much larger for small values of the shape parameter α and β. However, for shape parameter values α, β > 1, the variances (and therefore the curvatures) flatten out. Equivalently, this result follows from the [[Cramér–Rao bound]], since the [[Fisher information]] matrix components for the beta distribution are these logarithmic variances. The [[Cramér–Rao bound]] states that the [[variance]] of any ''unbiased'' estimator <math>\hat{\alpha}</math> of α is bounded by the [[multiplicative inverse|reciprocal]] of the [[Fisher information]]: :<math>\mathrm{var}(\hat{\alpha})\geq\frac{1}{\operatorname{var}[\ln X]}\geq\frac{1}{\psi_1(\hat{\alpha}) - \psi_1(\hat{\alpha} + \hat{\beta})}</math> :<math>\mathrm{var}(\hat{\beta}) \geq\frac{1}{\operatorname{var}[\ln (1-X)]}\geq\frac{1}{\psi_1(\hat{\beta}) - \psi_1(\hat{\alpha} + \hat{\beta})}</math> so the variance of the estimators increases with increasing α and β, as the logarithmic variances decrease. Also one can express the joint log likelihood per ''N'' [[independent and identically distributed random variables|iid]] observations in terms of the [[digamma function]] expressions for the logarithms of the sample geometric means as follows: :<math>\frac{\ln\, \mathcal{L} (\alpha, \beta\mid X)}{N} = (\alpha - 1)(\psi(\hat{\alpha}) - \psi(\hat{\alpha} + \hat{\beta}))+(\beta- 1)(\psi(\hat{\beta}) - \psi(\hat{\alpha} + \hat{\beta}))- \ln \Beta(\alpha,\beta)</math> this expression is identical to the negative of the cross-entropy (see section on "Quantities of information (entropy)"). Therefore, finding the maximum of the joint log likelihood of the shape parameters, per ''N'' [[independent and identically distributed random variables|iid]] observations, is identical to finding the minimum of the cross-entropy for the beta distribution, as a function of the shape parameters. :<math>\frac{\ln\, \mathcal{L} (\alpha, \beta\mid X)}{N} = - H = -h - D_{\mathrm{KL}} = -\ln\Beta(\alpha,\beta)+(\alpha-1)\psi(\hat{\alpha})+(\beta-1)\psi(\hat{\beta})-(\alpha+\beta-2)\psi(\hat{\alpha}+\hat{\beta})</math> with the cross-entropy defined as follows: :<math>H = \int_{0}^1 - f(X;\hat{\alpha},\hat{\beta}) \ln (f(X;\alpha,\beta)) \, {\rm d}X </math> =====Four unknown parameters===== The procedure is similar to the one followed in the two unknown parameter case. If ''Y''<sub>1</sub>, ..., ''Y<sub>N</sub>'' are independent random variables each having a beta distribution with four parameters, the joint log likelihood function for ''N'' [[independent and identically distributed random variables|iid]] observations is: :<math>\begin{align} \ln\, \mathcal{L} (\alpha, \beta, a, c\mid Y) &= \sum_{i=1}^N \ln\,\mathcal{L}_i (\alpha, \beta, a, c\mid Y_i)\\ &= \sum_{i=1}^N \ln\,f(Y_i; \alpha, \beta, a, c) \\ &= \sum_{i=1}^N \ln\,\frac{(Y_i-a)^{\alpha-1} (c-Y_i)^{\beta-1} }{(c-a)^{\alpha+\beta-1}\Beta(\alpha, \beta)}\\ &= (\alpha - 1)\sum_{i=1}^N \ln (Y_i - a) + (\beta- 1)\sum_{i=1}^N \ln (c - Y_i)- N \ln \Beta(\alpha,\beta) - N (\alpha+\beta - 1) \ln (c - a) \end{align}</math> Finding the maximum with respect to a shape parameter involves taking the partial derivative with respect to the shape parameter and setting the expression equal to zero yielding the [[maximum likelihood]] estimator of the shape parameters: :<math>\frac{\partial \ln \mathcal{L} (\alpha, \beta, a, c\mid Y) }{\partial \alpha}= \sum_{i=1}^N \ln (Y_i - a) - N(-\psi(\alpha + \beta) + \psi(\alpha))- N \ln (c - a)= 0</math> :<math>\frac{\partial \ln \mathcal{L} (\alpha, \beta, a, c\mid Y) }{\partial \beta} = \sum_{i=1}^N \ln (c - Y_i) - N(-\psi(\alpha + \beta) + \psi(\beta))- N \ln (c - a)= 0</math> :<math>\frac{\partial \ln \mathcal{L} (\alpha, \beta, a, c\mid Y) }{\partial a} = -(\alpha - 1) \sum_{i=1}^N \frac{1}{Y_i - a} \,+ N (\alpha+\beta - 1)\frac{1}{c - a}= 0</math> :<math>\frac{\partial \ln \mathcal{L} (\alpha, \beta, a, c\mid Y) }{\partial c} = (\beta- 1) \sum_{i=1}^N \frac{1}{c - Y_i} \,- N (\alpha+\beta - 1) \frac{1}{c - a} = 0</math> these equations can be re-arranged as the following system of four coupled equations (the first two equations are geometric means and the second two equations are the harmonic means) in terms of the maximum likelihood estimates for the four parameters <math>\hat{\alpha}, \hat{\beta}, \hat{a}, \hat{c}</math>: :<math>\frac{1}{N}\sum_{i=1}^N \ln \frac{Y_i - \hat{a}}{\hat{c}-\hat{a}} = \psi(\hat{\alpha})-\psi(\hat{\alpha} +\hat{\beta} )= \ln \hat{G}_X</math> :<math>\frac{1}{N}\sum_{i=1}^N \ln \frac{\hat{c} - Y_i}{\hat{c}-\hat{a}} = \psi(\hat{\beta})-\psi(\hat{\alpha} + \hat{\beta})= \ln \hat{G}_{1-X}</math> :<math>\frac{1}{\frac{1}{N}\sum_{i=1}^N \frac{\hat{c} - \hat{a}}{Y_i - \hat{a}}} = \frac{\hat{\alpha} - 1}{\hat{\alpha}+\hat{\beta} - 1}= \hat{H}_X</math> :<math>\frac{1}{\frac{1}{N}\sum_{i=1}^N \frac{\hat{c} - \hat{a}}{\hat{c} - Y_i}} = \frac{\hat{\beta}- 1}{\hat{\alpha}+\hat{\beta} - 1} = \hat{H}_{1-X}</math> with sample geometric means: :<math>\hat{G}_X = \prod_{i=1}^{N} \left (\frac{Y_i - \hat{a}}{\hat{c}-\hat{a}} \right )^{\frac{1}{N}}</math> :<math>\hat{G}_{(1-X)} = \prod_{i=1}^{N} \left (\frac{\hat{c} - Y_i}{\hat{c}-\hat{a}} \right )^{\frac{1}{N}}</math> The parameters <math>\hat{a}, \hat{c}</math> are embedded inside the geometric mean expressions in a nonlinear way (to the power 1/''N''). This precludes, in general, a closed form solution, even for an initial value approximation for iteration purposes. One alternative is to use as initial values for iteration the values obtained from the method of moments solution for the four parameter case. Furthermore, the expressions for the harmonic means are well-defined only for <math>\hat{\alpha}, \hat{\beta} > 1</math>, which precludes a maximum likelihood solution for shape parameters less than unity in the four-parameter case. Fisher's information matrix for the four parameter case is [[Positive-definite matrix|positive-definite]] only for α, β > 2 (for further discussion, see section on Fisher information matrix, four parameter case), for bell-shaped (symmetric or unsymmetric) beta distributions, with inflection points located to either side of the mode. The following Fisher information components (that represent the expectations of the curvature of the log likelihood function) have [[mathematical singularity|singularities]] at the following values: :<math>\alpha = 2: \quad \operatorname{E} \left [- \frac{1}{N} \frac{\partial^2\ln \mathcal{L} (\alpha, \beta, a, c\mid Y)}{\partial a^2} \right ]= {\mathcal{I}}_{a, a}</math> :<math>\beta = 2: \quad \operatorname{E}\left [- \frac{1}{N} \frac{\partial^2\ln \mathcal{L} (\alpha, \beta, a, c\mid Y)}{\partial c^2} \right ] = {\mathcal{I}}_{c, c}</math> :<math>\alpha = 2: \quad \operatorname{E}\left [- \frac{1}{N}\frac{\partial^2\ln \mathcal{L} (\alpha, \beta, a, c\mid Y)}{\partial \alpha \partial a}\right ] = {\mathcal{I}}_{\alpha, a} </math> :<math>\beta = 1: \quad \operatorname{E}\left [- \frac{1}{N}\frac{\partial^2\ln \mathcal{L} (\alpha, \beta, a, c\mid Y)}{\partial \beta \partial c} \right ] = {\mathcal{I}}_{\beta, c} </math> (for further discussion see section on Fisher information matrix). Thus, it is not possible to strictly carry on the maximum likelihood estimation for some well known distributions belonging to the four-parameter beta distribution family, like the [[continuous uniform distribution|uniform distribution]] (Beta(1, 1, ''a'', ''c'')), and the [[arcsine distribution]] (Beta(1/2, 1/2, ''a'', ''c'')). [[Norman Lloyd Johnson|N.L.Johnson]] and [[Samuel Kotz|S.Kotz]]<ref name=JKB /> ignore the equations for the harmonic means and instead suggest "If a and c are unknown, and maximum likelihood estimators of ''a'', ''c'', α and β are required, the above procedure (for the two unknown parameter case, with ''X'' transformed as ''X'' = (''Y'' − ''a'')/(''c'' − ''a'')) can be repeated using a succession of trial values of ''a'' and ''c'', until the pair (''a'', ''c'') for which maximum likelihood (given ''a'' and ''c'') is as great as possible, is attained" (where, for the purpose of clarity, their notation for the parameters has been translated into the present notation). ====Fisher information matrix==== Let a random variable X have a probability density ''f''(''x'';''α''). The partial derivative with respect to the (unknown, and to be estimated) parameter α of the log [[likelihood function]] is called the [[score (statistics)|score]]. The second moment of the score is called the [[Fisher information]]: :<math>\mathcal{I}(\alpha)=\operatorname{E} \left [\left (\frac{\partial}{\partial\alpha} \ln \mathcal{L}(\alpha\mid X) \right )^2 \right],</math> The [[expected value|expectation]] of the [[score (statistics)|score]] is zero, therefore the Fisher information is also the second moment centered on the mean of the score: the [[variance]] of the score. If the log [[likelihood function]] is twice differentiable with respect to the parameter α, and under certain regularity conditions,<ref name=Silvey>{{cite book|last=Silvey|first=S.D.|title=Statistical Inference|year=1975|publisher=Chapman and Hal|page=40|isbn=978-0412138201}}</ref> then the Fisher information may also be written as follows (which is often a more convenient form for calculation purposes): :<math>\mathcal{I}(\alpha) = - \operatorname{E} \left [\frac{\partial^2}{\partial\alpha^2} \ln (\mathcal{L}(\alpha\mid X)) \right].</math> Thus, the Fisher information is the negative of the expectation of the second [[derivative]] with respect to the parameter α of the log [[likelihood function]]. Therefore, Fisher information is a measure of the [[curvature]] of the log likelihood function of α. A low [[curvature]] (and therefore high [[Radius of curvature (mathematics)|radius of curvature]]), flatter log likelihood function curve has low Fisher information; while a log likelihood function curve with large [[curvature]] (and therefore low [[Radius of curvature (mathematics)|radius of curvature]]) has high Fisher information. When the Fisher information matrix is computed at the evaluates of the parameters ("the observed Fisher information matrix") it is equivalent to the replacement of the true log likelihood surface by a Taylor's series approximation, taken as far as the quadratic terms.<ref name=EdwardsLikelihood>{{cite book|last=Edwards|first=A. W. F.|title=Likelihood|year=1992 |publisher=The Johns Hopkins University Press|isbn=978-0801844430}}</ref> The word information, in the context of Fisher information, refers to information about the parameters. Information such as: estimation, sufficiency and properties of variances of estimators. The [[Cramér–Rao bound]] states that the inverse of the Fisher information is a lower bound on the variance of any [[estimator]] of a parameter α: :<math>\operatorname{var}[\hat\alpha] \geq \frac{1}{\mathcal{I}(\alpha)}.</math> The precision to which one can estimate the estimator of a parameter α is limited by the Fisher Information of the log likelihood function. The Fisher information is a measure of the minimum error involved in estimating a parameter of a distribution and it can be viewed as a measure of the resolving power of an experiment needed to discriminate between two alternative hypothesis of a parameter.<ref name=Jaynes>{{cite book|last=Jaynes|first=E.T.|title=Probability theory, the logic of science|year=2003|publisher=Cambridge University Press|isbn=978-0521592710}}</ref> When there are ''N'' parameters :<math> \begin{bmatrix} \theta_1 \\ \theta_2 \\ \vdots \\ \theta_N \end{bmatrix},</math> then the Fisher information takes the form of an ''N''×''N'' [[positive semidefinite matrix|positive semidefinite]] [[symmetric matrix]], the Fisher information matrix, with typical element: :<math> (\mathcal{I}(\theta))_{i, j} = \operatorname{E} \left [\left (\frac{\partial}{\partial\theta_i} \ln \mathcal{L} \right) \left(\frac \partial {\partial\theta_j} \ln \mathcal{L} \right) \right ].</math> Under certain regularity conditions,<ref name=Silvey/> the Fisher Information Matrix may also be written in the following form, which is often more convenient for computation: :<math> (\mathcal{I}(\theta))_{i, j} = - \operatorname{E} \left [\frac{\partial^2}{\partial\theta_i \, \partial\theta_j} \ln (\mathcal{L}) \right ]\,.</math> With ''X''<sub>1</sub>, ..., ''X<sub>N</sub>'' [[iid]] random variables, an ''N''-dimensional "box" can be constructed with sides ''X''<sub>1</sub>, ..., ''X<sub>N</sub>''. Costa and Cover<ref name=CostaCover>{{cite book|last=Costa|first=Max, and Cover, Thomas|title=On the similarity of the entropy power inequality and the Brunn Minkowski inequality|date=September 1983|publisher=Tech.Report 48, Dept. Statistics, Stanford University|url=https://isl.stanford.edu/people/cover/papers/transIT/0837cost.pdf}}</ref> show that the (Shannon) differential entropy ''h''(''X'') is related to the volume of the typical set (having the sample entropy close to the true entropy), while the Fisher information is related to the surface of this typical set. =====Two parameters===== For ''X''<sub>1</sub>, ..., ''X''<sub>''N''</sub> independent random variables each having a beta distribution parametrized with shape parameters ''α'' and ''β'', the joint log likelihood function for ''N'' [[independent and identically distributed random variables|iid]] observations is: : <math>\ln (\mathcal{L} (\alpha, \beta\mid X) )= (\alpha - 1)\sum_{i=1}^N \ln X_i + (\beta- 1)\sum_{i=1}^N \ln (1-X_i)- N \ln \Beta(\alpha,\beta) </math> therefore the joint log likelihood function per ''N'' [[independent and identically distributed random variables|iid]] observations is :<math>\frac{1}{N} \ln(\mathcal{L} (\alpha, \beta\mid X)) = (\alpha - 1)\frac{1}{N}\sum_{i=1}^N \ln X_i + (\beta- 1) \frac{1}{N}\sum_{i=1}^N \ln (1-X_i)-\, \ln \Beta(\alpha,\beta). </math> For the two parameter case, the Fisher information has 4 components: 2 diagonal and 2 off-diagonal. Since the Fisher information matrix is symmetric, one of these off diagonal components is independent. Therefore, the Fisher information matrix has 3 independent components (2 diagonal and 1 off diagonal). Aryal and Nadarajah<ref name=Aryal>{{cite journal|last=Aryal|first=Gokarna|author2=Saralees Nadarajah|title=Information matrix for beta distributions|journal=Serdica Mathematical Journal (Bulgarian Academy of Science)| year=2004| volume=30|pages=513–526|url=http://www.math.bas.bg/serdica/2004/2004-513-526.pdf}}</ref> calculated Fisher's information matrix for the four-parameter case, from which the two parameter case can be obtained as follows: :<math>- \frac{\partial^2\ln \mathcal{L}(\alpha,\beta\mid X)}{N\partial \alpha^2}= \operatorname{var}[\ln (X)]= \psi_1(\alpha) - \psi_1(\alpha + \beta) ={\mathcal{I}}_{\alpha, \alpha}= \operatorname{E}\left [- \frac{\partial^2\ln \mathcal{L}(\alpha,\beta\mid X)}{N\partial \alpha^2} \right ] = \ln \operatorname{var}_{GX} </math> :<math>- \frac{\partial^2\ln \mathcal{L}(\alpha,\beta\mid X)}{N\,\partial \beta^2} = \operatorname{var}[\ln (1-X)] = \psi_1(\beta) - \psi_1(\alpha + \beta) ={\mathcal{I}}_{\beta, \beta}= \operatorname{E}\left [- \frac{\partial^2\ln \mathcal{L}(\alpha,\beta\mid X)}{N\partial \beta^2} \right]= \ln \operatorname{var}_{G(1-X)} </math> :<math>- \frac{\partial^2\ln \mathcal{L}(\alpha,\beta\mid X)}{N \, \partial \alpha \, \partial \beta} = \operatorname{cov}[\ln X,\ln(1-X)] = -\psi_1(\alpha+\beta) ={\mathcal{I}}_{\alpha, \beta}= \operatorname{E}\left [- \frac{\partial^2\ln \mathcal{L}(\alpha,\beta\mid X)}{N\,\partial \alpha\,\partial \beta} \right] = \ln \operatorname{cov}_{G{X,(1-X)}}</math> Since the Fisher information matrix is symmetric :<math> \mathcal{I}_{\alpha, \beta}= \mathcal{I}_{\beta, \alpha}= \ln \operatorname{cov}_{G{X,(1-X)}}</math> The Fisher information components are equal to the log geometric variances and log geometric covariance. Therefore, they can be expressed as '''[[trigamma function]]s''', denoted ψ<sub>1</sub>(α), the second of the [[polygamma function]]s, defined as the derivative of the [[digamma]] function: :<math>\psi_1(\alpha) = \frac{d^2\ln\Gamma(\alpha)}{\partial\alpha^2}=\, \frac{\partial \psi(\alpha)}{\partial\alpha}. </math> These derivatives are also derived in the {{section link||Two unknown parameters}} and plots of the log likelihood function are also shown in that section. {{section link||Geometric variance and covariance}} contains plots and further discussion of the Fisher information matrix components: the log geometric variances and log geometric covariance as a function of the shape parameters α and β. {{section link||Moments of logarithmically transformed random variables}} contains formulas for moments of logarithmically transformed random variables. Images for the Fisher information components <math>\mathcal{I}_{\alpha, \alpha}, \mathcal{I}_{\beta, \beta}</math> and <math>\mathcal{I}_{\alpha, \beta}</math> are shown in {{section link||Geometric variance}}. The determinant of Fisher's information matrix is of interest (for example for the calculation of [[Jeffreys prior]] probability). From the expressions for the individual components of the Fisher information matrix, it follows that the determinant of Fisher's (symmetric) information matrix for the beta distribution is: :<math>\begin{align} \det(\mathcal{I}(\alpha, \beta))&= \mathcal{I}_{\alpha, \alpha} \mathcal{I}_{\beta, \beta}-\mathcal{I}_{\alpha, \beta} \mathcal{I}_{\alpha, \beta} \\[4pt] &=(\psi_1(\alpha) - \psi_1(\alpha + \beta))(\psi_1(\beta) - \psi_1(\alpha + \beta))-( -\psi_1(\alpha+\beta))( -\psi_1(\alpha+\beta))\\[4pt] &= \psi_1(\alpha)\psi_1(\beta)-( \psi_1(\alpha)+\psi_1(\beta))\psi_1(\alpha + \beta)\\[4pt] \lim_{\alpha\to 0} \det(\mathcal{I}(\alpha, \beta)) &=\lim_{\beta \to 0} \det(\mathcal{I}(\alpha, \beta)) = \infty\\[4pt] \lim_{\alpha\to \infty} \det(\mathcal{I}(\alpha, \beta)) &=\lim_{\beta \to \infty} \det(\mathcal{I}(\alpha, \beta)) = 0 \end{align}</math> From [[Sylvester's criterion]] (checking whether the diagonal elements are all positive), it follows that the Fisher information matrix for the two parameter case is [[Positive-definite matrix|positive-definite]] (under the standard condition that the shape parameters are positive ''α'' > 0 and ''β'' > 0). =====Four parameters===== [[File:Fisher Information I(a,a) for alpha=beta vs range (c-a) and exponent alpha=beta - J. Rodal.png|thumb|Fisher Information ''I''(''a'',''a'') for ''α'' = ''β'' vs range (''c'' − ''a'') and exponent ''α'' = ''β'']] [[File:Fisher Information I(alpha,a) for alpha=beta, vs. range (c - a) and exponent alpha=beta - J. Rodal.png|thumb|Fisher Information ''I''(''α'',''a'') for ''α'' = ''β'', vs. range (''c'' − ''a'') and exponent ''α'' = ''β'']] If ''Y''<sub>1</sub>, ..., ''Y<sub>N</sub>'' are independent random variables each having a beta distribution with four parameters: the exponents ''α'' and ''β'', and also ''a'' (the minimum of the distribution range), and ''c'' (the maximum of the distribution range) (section titled "Alternative parametrizations", "Four parameters"), with [[probability density function]]: :<math>f(y; \alpha, \beta, a, c) = \frac{f(x;\alpha,\beta)}{c-a} =\frac{ \left (\frac{y-a}{c-a} \right )^{\alpha-1} \left (\frac{c-y}{c-a} \right)^{\beta-1} }{(c-a)B(\alpha, \beta)}=\frac{ (y-a)^{\alpha-1} (c-y)^{\beta-1} }{(c-a)^{\alpha+\beta-1}B(\alpha, \beta)}.</math> the joint log likelihood function per ''N'' [[independent and identically distributed random variables|iid]] observations is: :<math>\frac{1}{N} \ln(\mathcal{L} (\alpha, \beta, a, c\mid Y))= \frac{\alpha -1}{N}\sum_{i=1}^N \ln (Y_i - a) + \frac{\beta -1}{N}\sum_{i=1}^N \ln (c - Y_i)- \ln \Beta(\alpha,\beta) - (\alpha+\beta -1) \ln (c-a) </math> For the four parameter case, the Fisher information has 4*4=16 components. It has 12 off-diagonal components = (4×4 total − 4 diagonal). Since the Fisher information matrix is symmetric, half of these components (12/2=6) are independent. Therefore, the Fisher information matrix has 6 independent off-diagonal + 4 diagonal = 10 independent components. Aryal and Nadarajah<ref name=Aryal/> calculated Fisher's information matrix for the four parameter case as follows: :<math>- \frac{1}{N} \frac{\partial^2\ln \mathcal{L} (\alpha, \beta, a, c\mid Y)}{\partial \alpha^2}= \operatorname{var}[\ln (X)]= \psi_1(\alpha) - \psi_1(\alpha + \beta) = \mathcal{I}_{\alpha, \alpha}= \operatorname{E}\left [- \frac{1}{N} \frac{\partial^2\ln \mathcal{L} (\alpha, \beta, a, c\mid Y)}{\partial \alpha^2} \right ] = \ln (\operatorname{var_{GX}}) </math> :<math>-\frac{1}{N} \frac{\partial^2\ln \mathcal{L} (\alpha, \beta, a, c\mid Y)}{\partial \beta^2} = \operatorname{var}[\ln (1-X)] = \psi_1(\beta) - \psi_1(\alpha + \beta) ={\mathcal{I}}_{\beta, \beta}= \operatorname{E} \left [- \frac{1}{N} \frac{\partial^2\ln \mathcal{L} (\alpha, \beta, a, c\mid Y)}{\partial \beta^2} \right ] = \ln(\operatorname{var_{G(1-X)}}) </math> :<math>-\frac{1}{N} \frac{\partial^2\ln \mathcal{L} (\alpha, \beta, a, c\mid Y)}{\partial \alpha\,\partial \beta} = \operatorname{cov}[\ln X,(1-X)] = -\psi_1(\alpha+\beta) =\mathcal{I}_{\alpha, \beta}= \operatorname{E} \left [- \frac{1}{N}\frac{\partial^2\ln \mathcal{L} (\alpha, \beta, a, c\mid Y)}{\partial \alpha \, \partial \beta} \right ] = \ln(\operatorname{cov}_{G{X,(1-X)}})</math> In the above expressions, the use of ''X'' instead of ''Y'' in the expressions var[ln(''X'')] = ln(var<sub>''GX''</sub>) is ''not an error''. The expressions in terms of the log geometric variances and log geometric covariance occur as functions of the two parameter ''X'' ~ Beta(''α'', ''β'') parametrization because when taking the partial derivatives with respect to the exponents (''α'', ''β'') in the four parameter case, one obtains the identical expressions as for the two parameter case: these terms of the four parameter Fisher information matrix are independent of the minimum ''a'' and maximum ''c'' of the distribution's range. The only non-zero term upon double differentiation of the log likelihood function with respect to the exponents ''α'' and ''β'' is the second derivative of the log of the beta function: ln(B(''α'', ''β'')). This term is independent of the minimum ''a'' and maximum ''c'' of the distribution's range. Double differentiation of this term results in trigamma functions. The sections titled "Maximum likelihood", "Two unknown parameters" and "Four unknown parameters" also show this fact. The Fisher information for ''N'' [[i.i.d.]] samples is ''N'' times the individual Fisher information (eq. 11.279, page 394 of Cover and Thomas<ref name="Cover and Thomas"/>). (Aryal and Nadarajah<ref name=Aryal/> take a single observation, ''N'' = 1, to calculate the following components of the Fisher information, which leads to the same result as considering the derivatives of the log likelihood per ''N'' observations. Moreover, below the erroneous expression for <math>{\mathcal{I}}_{a, a}</math> in Aryal and Nadarajah has been corrected.) :<math>\begin{align} \alpha > 2: \quad \operatorname{E}\left [- \frac{1}{N} \frac{\partial^2\ln \mathcal{L} (\alpha, \beta, a, c\mid Y)}{\partial a^2} \right ] &= {\mathcal{I}}_{a, a}=\frac{\beta(\alpha+\beta-1)}{(\alpha-2)(c-a)^2} \\ \beta > 2: \quad \operatorname{E}\left[-\frac{1}{N} \frac{\partial^2\ln \mathcal{L} (\alpha, \beta, a, c\mid Y)}{\partial c^2} \right ] &= \mathcal{I}_{c, c} = \frac{\alpha(\alpha+\beta-1)}{(\beta-2)(c-a)^2} \\ \operatorname{E}\left[- \frac{1}{N} \frac{\partial^2\ln \mathcal{L} (\alpha, \beta, a, c\mid Y)}{\partial a \, \partial c} \right ] &= {\mathcal{I}}_{a, c} = \frac{(\alpha+\beta-1)}{(c-a)^2} \\ \alpha > 1: \quad \operatorname{E}\left[- \frac{1}{N} \frac{\partial^2\ln \mathcal{L} (\alpha, \beta, a, c\mid Y)}{\partial \alpha \, \partial a} \right ] &=\mathcal{I}_{\alpha, a} = \frac{\beta}{(\alpha-1)(c-a)} \\ \operatorname{E}\left[- \frac{1}{N} \frac{\partial^2\ln \mathcal{L} (\alpha, \beta, a, c\mid Y)}{\partial \alpha \, \partial c} \right ] &= {\mathcal{I}}_{\alpha, c} = \frac{1}{(c-a)} \\ \operatorname{E}\left[- \frac{1}{N} \frac{\partial^2\ln \mathcal{L} (\alpha, \beta, a, c\mid Y)}{\partial \beta \,\partial a} \right ] &= {\mathcal{I}}_{\beta, a} = -\frac{1}{(c-a)} \\ \beta > 1: \quad \operatorname{E}\left[- \frac{1}{N} \frac{\partial^2\ln \mathcal{L} (\alpha, \beta, a, c\mid Y)}{\partial \beta \, \partial c} \right ] &= \mathcal{I}_{\beta, c} = -\frac{\alpha}{(\beta-1)(c-a)} \end{align}</math> The lower two diagonal entries of the Fisher information matrix, with respect to the parameter ''a'' (the minimum of the distribution's range): <math>\mathcal{I}_{a, a}</math>, and with respect to the parameter ''c'' (the maximum of the distribution's range): <math>\mathcal{I}_{c, c}</math> are only defined for exponents ''α'' > 2 and ''β'' > 2 respectively. The Fisher information matrix component <math>\mathcal{I}_{a, a}</math> for the minimum ''a'' approaches infinity for exponent α approaching 2 from above, and the Fisher information matrix component <math>\mathcal{I}_{c, c}</math> for the maximum ''c'' approaches infinity for exponent ''β'' approaching 2 from above. The Fisher information matrix for the four parameter case does not depend on the individual values of the minimum ''a'' and the maximum ''c'', but only on the total range (''c'' − ''a''). Moreover, the components of the Fisher information matrix that depend on the range (''c'' − ''a''), depend only through its inverse (or the square of the inverse), such that the Fisher information decreases for increasing range (''c'' − ''a''). The accompanying images show the Fisher information components <math>\mathcal{I}_{a, a}</math> and <math>\mathcal{I}_{\alpha, a}</math>. Images for the Fisher information components <math>\mathcal{I}_{\alpha, \alpha}</math> and <math>\mathcal{I}_{\beta, \beta}</math> are shown in {{section link||Geometric variance}}. All these Fisher information components look like a basin, with the "walls" of the basin being located at low values of the parameters. The following four-parameter-beta-distribution Fisher information components can be expressed in terms of the two-parameter: ''X'' ~ Beta(α, β) expectations of the transformed ratio ((1 − ''X'')/''X'') and of its mirror image (''X''/(1 − ''X'')), scaled by the range (''c'' − ''a''), which may be helpful for interpretation: :<math>\mathcal{I}_{\alpha, a} =\frac{\operatorname{E} \left[\frac{1-X}{X} \right ]}{c-a}= \frac{\beta}{(\alpha-1)(c-a)} \text{ if }\alpha > 1</math> :<math>\mathcal{I}_{\beta, c} = -\frac{\operatorname{E} \left [\frac{X}{1-X} \right ]}{c-a}=- \frac{\alpha}{(\beta-1)(c-a)}\text{ if }\beta> 1</math> These are also the expected values of the "inverted beta distribution" or [[beta prime distribution]] (also known as beta distribution of the second kind or [[Pearson distribution|Pearson's Type VI]]) <ref name=JKB/> and its mirror image, scaled by the range (''c'' − ''a''). Also, the following Fisher information components can be expressed in terms of the harmonic (1/X) variances or of variances based on the ratio transformed variables ((1-X)/X) as follows: :<math>\begin{align} \alpha > 2: \quad \mathcal{I}_{a,a} &=\operatorname{var} \left [\frac{1}{X} \right] \left (\frac{\alpha-1}{c-a} \right )^2 =\operatorname{var} \left [\frac{1-X}{X} \right ] \left (\frac{\alpha-1}{c-a} \right)^2 = \frac{\beta(\alpha+\beta-1)}{(\alpha-2)(c-a)^2} \\ \beta > 2: \quad \mathcal{I}_{c, c} &= \operatorname{var} \left [\frac{1}{1-X} \right ] \left (\frac{\beta-1}{c-a} \right )^2 = \operatorname{var} \left [\frac{X}{1-X} \right ] \left (\frac{\beta-1}{c-a} \right )^2 =\frac{\alpha(\alpha+\beta-1)}{(\beta-2)(c-a)^2} \\ \mathcal{I}_{a, c} &=\operatorname{cov} \left [\frac{1}{X},\frac{1}{1-X} \right ]\frac{(\alpha-1)(\beta-1)}{(c-a)^2} = \operatorname{cov} \left [\frac{1-X}{X},\frac{X}{1-X} \right ] \frac{(\alpha-1)(\beta-1)}{(c-a)^2} =\frac{(\alpha+\beta-1)}{(c-a)^2} \end{align}</math> See section "Moments of linearly transformed, product and inverted random variables" for these expectations. The determinant of Fisher's information matrix is of interest (for example for the calculation of [[Jeffreys prior]] probability). From the expressions for the individual components, it follows that the determinant of Fisher's (symmetric) information matrix for the beta distribution with four parameters is: :<math>\begin{align} \det(\mathcal{I}(\alpha,\beta,a,c)) = {} & -\mathcal{I}_{a,c}^2 \mathcal{I}_{\alpha,a} \mathcal{I}_{\alpha,\beta }+\mathcal{I}_{a,a} \mathcal{I}_{a,c} \mathcal{I}_{\alpha,c} \mathcal{I}_{\alpha ,\beta}+\mathcal{I}_{a,c}^2 \mathcal{I}_{\alpha ,\beta}^2 -\mathcal{I}_{a,a} \mathcal{I}_{c,c} \mathcal{I}_{\alpha,\beta}^2\\ & {} -\mathcal{I}_{a,c} \mathcal{I}_{\alpha,a} \mathcal{I}_{\alpha ,c} \mathcal{I}_{\beta,a}+\mathcal{I}_{a,c}^2 \mathcal{I}_{\alpha ,\alpha} \mathcal{I}_{\beta,a}+2 \mathcal{I}_{c,c} \mathcal{I}_{\alpha,a} \mathcal{I}_{\alpha,\beta} \mathcal{I}_{\beta,a}\\ & {}-2\mathcal{I}_{a,c} \mathcal{I}_{\alpha ,c} \mathcal{I}_{\alpha,\beta} \mathcal{I}_{\beta ,a}+\mathcal{I}_{\alpha ,c}^2 \mathcal{I}_{\beta ,a}^2-\mathcal{I}_{c,c} \mathcal{I}_{\alpha,\alpha} \mathcal{I}_{\beta ,a}^2+\mathcal{I}_{a,c} \mathcal{I}_{\alpha ,a}^2 \mathcal{I}_{\beta ,c}\\ & {}-\mathcal{I}_{a,a} \mathcal{I}_{a,c} \mathcal{I}_{\alpha ,\alpha } \mathcal{I}_{\beta ,c}-\mathcal{I}_{a,c} \mathcal{I}_{\alpha ,a} \mathcal{I}_{\alpha ,\beta } \mathcal{I}_{\beta ,c}+\mathcal{I}_{a,a} \mathcal{I}_{\alpha ,c} \mathcal{I}_{\alpha ,\beta } \mathcal{I}_{\beta ,c}\\ & {}-\mathcal{I}_{\alpha ,a} \mathcal{I}_{\alpha ,c} \mathcal{I}_{\beta ,a} \mathcal{I}_{\beta ,c}+\mathcal{I}_{a,c} \mathcal{I}_{\alpha ,\alpha } \mathcal{I}_{\beta ,a} \mathcal{I}_{\beta ,c}-\mathcal{I}_{c,c} \mathcal{I}_{\alpha ,a}^2 \mathcal{I}_{\beta ,\beta }\\ & {}+2 \mathcal{I}_{a,c} \mathcal{I}_{\alpha ,a} \mathcal{I}_{\alpha, c} \mathcal{I}_{\beta ,\beta }-\mathcal{I}_{a,a} \mathcal{I}_{\alpha ,c}^2 \mathcal{I}_{\beta ,\beta }-\mathcal{I}_{a,c}^2 \mathcal{I}_{\alpha ,\alpha } \mathcal{I}_{\beta ,\beta }+\mathcal{I}_{a,a} \mathcal{I}_{c,c} \mathcal{I}_{\alpha ,\alpha } \mathcal{I}_{\beta ,\beta }\text{ if }\alpha, \beta> 2 \end{align}</math> Using [[Sylvester's criterion]] (checking whether the diagonal elements are all positive), and since diagonal components <math>{\mathcal{I}}_{a, a}</math> and <math>{\mathcal{I}}_{c, c}</math> have [[Mathematical singularity|singularities]] at α=2 and β=2 it follows that the Fisher information matrix for the four parameter case is [[Positive-definite matrix|positive-definite]] for α>2 and β>2. Since for α > 2 and β > 2 the beta distribution is (symmetric or unsymmetric) bell shaped, it follows that the Fisher information matrix is positive-definite only for bell-shaped (symmetric or unsymmetric) beta distributions, with inflection points located to either side of the mode. Thus, important well known distributions belonging to the four-parameter beta distribution family, like the parabolic distribution (Beta(2,2,a,c)) and the [[continuous uniform distribution|uniform distribution]] (Beta(1,1,a,c)) have Fisher information components (<math>\mathcal{I}_{a, a},\mathcal{I}_{c, c},\mathcal{I}_{\alpha, a},\mathcal{I}_{\beta, c}</math>) that blow up (approach infinity) in the four-parameter case (although their Fisher information components are all defined for the two parameter case). The four-parameter [[Wigner semicircle distribution]] (Beta(3/2,3/2,''a'',''c'')) and [[arcsine distribution]] (Beta(1/2,1/2,''a'',''c'')) have negative Fisher information determinants for the four-parameter case. ===Bayesian inference=== {{Main|Bayesian inference}} [[File:Beta(1,1) Uniform distribution - J. Rodal.png|thumb|<math>Beta(1,1)</math>: The [[uniform distribution (continuous)|uniform distribution]] probability density was proposed by [[Thomas Bayes]] to represent ignorance of prior probabilities in [[Bayesian inference]].]] The use of Beta distributions in [[Bayesian inference]] is due to the fact that they provide a family of [[conjugate prior distribution|conjugate prior probability distribution]]s for [[binomial distribution|binomial]] (including [[Bernoulli distribution|Bernoulli]]) and [[geometric distribution]]s. The domain of the beta distribution can be viewed as a probability, and in fact the beta distribution is often used to describe the distribution of a probability value ''p'':<ref name=MacKay/> :<math>P(p;\alpha,\beta) = \frac{p^{\alpha-1}(1-p)^{\beta-1}}{\Beta(\alpha,\beta)}.</math> Examples of beta distributions used as prior probabilities to represent ignorance of prior parameter values in Bayesian inference are Beta(1,1), Beta(0,0) and Beta(1/2,1/2). ====Rule of succession==== {{Main|Rule of succession}} A classic application of the beta distribution is the [[rule of succession]], introduced in the 18th century by [[Pierre-Simon Laplace]]<ref name=Laplace>{{cite book|last=Laplace|first=Pierre Simon, marquis de|title=A philosophical essay on probabilities|year=1902|publisher=New York : J. Wiley; London : Chapman & Hall|isbn=978-1-60206-328-0|url=https://archive.org/details/philosophicaless00lapliala}}</ref> in the course of treating the [[sunrise problem]]. It states that, given ''s'' successes in ''n'' [[conditional independence|conditionally independent]] [[Bernoulli trial]]s with probability ''p,'' that the estimate of the expected value in the next trial is <math>\frac{s+1}{n+2}</math>. This estimate is the expected value of the posterior distribution over ''p,'' namely Beta(''s''+1, ''n''−''s''+1), which is given by [[Bayes' rule]] if one assumes a uniform prior probability over ''p'' (i.e., Beta(1, 1)) and then observes that ''p'' generated ''s'' successes in ''n'' trials. Laplace's rule of succession has been criticized by prominent scientists. R. T. Cox described Laplace's application of the rule of succession to the [[sunrise problem]] (<ref name=CoxRT>{{cite book|last=Cox|first=Richard T.|title=Algebra of Probable Inference|year=1961|publisher=The Johns Hopkins University Press|isbn=978-0801869822}}</ref> p. 89) as "a travesty of the proper use of the principle". Keynes remarks (<ref name=KeynesTreatise>{{cite book|last=Keynes|first=John Maynard|title=A Treatise on Probability: The Connection Between Philosophy and the History of Science|orig-year=1921|year=2010|publisher=Wildside Press|isbn=978-1434406965}}</ref> Ch.XXX, p. 382) "indeed this is so foolish a theorem that to entertain it is discreditable". Karl Pearson<ref name=PearsonRuleSuccession>{{cite journal|last=Pearson|first=Karl|title=On the Influence of Past Experience on Future Expectation|journal=Philosophical Magazine|year=1907|volume=6|issue=13|pages=365–378}}</ref> showed that the probability that the next (''n'' + 1) trials will be successes, after n successes in n trials, is only 50%, which has been considered too low by scientists like Jeffreys and unacceptable as a representation of the scientific process of experimentation to test a proposed scientific law. As pointed out by Jeffreys (<ref name=Jeffreys/> p. 128) (crediting [[C. D. Broad]]<ref name=BroadMind>{{cite journal|last=Broad|first=C. D.|title=On the relation between induction and probability|journal=MIND, A Quarterly Review of Psychology and Philosophy|date=October 1918|volume=27 (New Series)|issue=108|pages=389–404|jstor=2249035|doi=10.1093/mind/XXVII.4.389}}</ref> ) Laplace's rule of succession establishes a high probability of success ((n+1)/(n+2)) in the next trial, but only a moderate probability (50%) that a further sample (n+1) comparable in size will be equally successful. As pointed out by Perks,<ref name=Perks>{{cite journal|last=Perks|first=Wilfred|title=Some observations on inverse probability including a new indifference rule|journal=Journal of the Institute of Actuaries|date=January 1947|volume=73|issue=2|pages=285–334|url=http://www.actuaries.org.uk/research-and-resources/documents/some-observations-inverse-probability-including-new-indifference-ru|doi=10.1017/S0020268100012270|access-date=2012-09-19|archive-date=2014-01-12|archive-url=https://web.archive.org/web/20140112111032/http://www.actuaries.org.uk/research-and-resources/documents/some-observations-inverse-probability-including-new-indifference-ru|url-status=dead}}</ref> "The rule of succession itself is hard to accept. It assigns a probability to the next trial which implies the assumption that the actual run observed is an average run and that we are always at the end of an average run. It would, one would think, be more reasonable to assume that we were in the middle of an average run. Clearly a higher value for both probabilities is necessary if they are to accord with reasonable belief." These problems with Laplace's rule of succession motivated Haldane, Perks, Jeffreys and others to search for other forms of prior probability (see the next {{section link||Bayesian inference}}). According to Jaynes,<ref name=Jaynes/> the main problem with the rule of succession is that it is not valid when s=0 or s=n (see [[rule of succession]], for an analysis of its validity). ====Bayes–Laplace prior probability (Beta(1,1))==== The beta distribution achieves maximum differential entropy for Beta(1,1): the [[Uniform density|uniform]] probability density, for which all values in the domain of the distribution have equal density. This uniform distribution Beta(1,1) was suggested ("with a great deal of doubt") by [[Thomas Bayes]]<ref name="ThomasBayes"/> as the prior probability distribution to express ignorance about the correct prior distribution. This prior distribution was adopted (apparently, from his writings, with little sign of doubt<ref name=Laplace/>) by [[Pierre-Simon Laplace]], and hence it was also known as the "Bayes–Laplace rule" or the "Laplace rule" of "[[inverse probability]]" in publications of the first half of the 20th century. In the later part of the 19th century and early part of the 20th century, scientists realized that the assumption of uniform "equal" probability density depended on the actual functions (for example whether a linear or a logarithmic scale was most appropriate) and parametrizations used. In particular, the behavior near the ends of distributions with finite support (for example near ''x'' = 0, for a distribution with initial support at ''x'' = 0) required particular attention. Keynes (<ref name=KeynesTreatise/> Ch.XXX, p. 381) criticized the use of Bayes's uniform prior probability (Beta(1,1)) that all values between zero and one are equiprobable, as follows: "Thus experience, if it shows anything, shows that there is a very marked clustering of statistical ratios in the neighborhoods of zero and unity, of those for positive theories and for correlations between positive qualities in the neighborhood of zero, and of those for negative theories and for correlations between negative qualities in the neighborhood of unity. " ===={{Anchor|Haldane prior}}Haldane's prior probability (Beta(0,0))==== [[File:Beta distribution for alpha and beta approaching zero - J. Rodal.png|thumb|<math>Beta(0,0)</math>: The Haldane prior probability expressing total ignorance about prior information, where we are not even sure whether it is physically possible for an experiment to yield either a success or a failure. As α, β → 0, the beta distribution approaches a two-point [[Bernoulli distribution]] with all probability density concentrated at each end, at 0 and 1, and nothing in between. A coin-toss: one face of the coin being at 0 and the other face being at 1. ]] The Beta(0,0) distribution was proposed by [[J.B.S. Haldane]],<ref>{{cite journal|last=Haldane |first=J. B. S.| authorlink1=J. B. S. Haldane |title=A note on inverse probability|journal=[[Mathematical Proceedings of the Cambridge Philosophical Society]]|year=1932|volume=28|issue=1|pages=55–61|doi=10.1017/s0305004100010495|bibcode=1932PCPS...28...55H|s2cid=122773707 }}</ref> who suggested that the prior probability representing complete uncertainty should be proportional to ''p''<sup>−1</sup>(1−''p'')<sup>−1</sup>. The function ''p''<sup>−1</sup>(1−''p'')<sup>−1</sup> can be viewed as the limit of the numerator of the beta distribution as both shape parameters approach zero: α, β → 0. The Beta function (in the denominator of the beta distribution) approaches infinity, for both parameters approaching zero, α, β → 0. Therefore, ''p''<sup>−1</sup>(1−''p'')<sup>−1</sup> divided by the Beta function approaches a 2-point [[Bernoulli distribution]] with equal probability 1/2 at each end, at 0 and 1, and nothing in between, as α, β → 0. A coin-toss: one face of the coin being at 0 and the other face being at 1. The Haldane prior probability distribution Beta(0,0) is an "[[improper prior]]" because its integration (from 0 to 1) fails to strictly converge to 1 due to the singularities at each end. However, this is not an issue for computing posterior probabilities unless the sample size is very small. Furthermore, Zellner<ref name=Zellner>{{cite book|last=Zellner |first=Arnold|title=An Introduction to Bayesian Inference in Econometrics|year=1971|publisher=Wiley-Interscience|isbn=978-0471169376}}</ref> points out that on the [[log-odds]] scale, (the [[logit]] transformation <math>\log(p/(1-p))</math>), the Haldane prior is the uniformly flat prior. The fact that a uniform prior probability on the [[logit]] transformed variable ln(''p''/1 − ''p'') (with domain (−∞, ∞)) is equivalent to the Haldane prior on the domain [0, 1] was pointed out by [[Harold Jeffreys]] in the first edition (1939) of his book Theory of Probability (<ref name=Jeffreys/> p. 123). Jeffreys writes "Certainly if we take the Bayes–Laplace rule right up to the extremes we are led to results that do not correspond to anybody's way of thinking. The (Haldane) rule d''x''/(''x''(1 − ''x'')) goes too far the other way. It would lead to the conclusion that if a sample is of one type with respect to some property there is a probability 1 that the whole population is of that type." The fact that "uniform" depends on the parametrization, led Jeffreys to seek a form of prior that would be invariant under different parametrizations. ====Jeffreys' prior probability (Beta(1/2,1/2) for a Bernoulli or for a binomial distribution)==== {{Main|Jeffreys prior}} [[File:Jeffreys prior probability for the beta distribution - J. Rodal.png|thumb|[[Jeffreys prior]] probability for the beta distribution: the square root of the determinant of [[Fisher's information]] matrix: <math>\scriptstyle\sqrt{\det(\mathcal{I}(\alpha, \beta))} = \sqrt{\psi_1(\alpha)\psi_1(\beta)-( \psi_1(\alpha)+\psi_1(\beta) )\psi_1(\alpha + \beta)}</math> is a function of the [[trigamma function]] ψ<sub>1</sub> of shape parameters α, β]] [[File:Beta distribution for 3 different prior probability functions - J. Rodal.png|thumb|Posterior Beta densities with samples having success = "s", failure = "f" of ''s''/(''s'' + ''f'') = 1/2, and ''s'' + ''f'' = {3,10,50}, based on 3 different prior probability functions: Haldane (Beta(0,0), Jeffreys (Beta(1/2,1/2)) and Bayes (Beta(1,1)). The image shows that there is little difference between the priors for the posterior with sample size of 50 (with more pronounced peak near ''p'' = 1/2). Significant differences appear for very small sample sizes (the flatter distribution for sample size of 3)]] [[File:Beta distribution for 3 different prior probability functions, skewed case - J. Rodal.png|thumb|Posterior Beta densities with samples having success = "s", failure = "f" of ''s''/(''s'' + ''f'') = 1/4, and ''s'' + ''f'' ∈ {3,10,50}, based on three different prior probability functions: Haldane (Beta(0,0), Jeffreys (Beta(1/2,1/2)) and Bayes (Beta(1,1)). The image shows that there is little difference between the priors for the posterior with sample size of 50 (with more pronounced peak near ''p'' = 1/4). Significant differences appear for very small sample sizes (the very skewed distribution for the degenerate case of sample size = 3, in this degenerate and unlikely case the Haldane prior results in a reverse "J" shape with mode at ''p'' = 0 instead of ''p'' = 1/4. If there is sufficient [[Sample (statistics)|sampling data]], the three priors of Bayes (Beta(1,1)), Jeffreys (Beta(1/2,1/2)) and Haldane (Beta(0,0)) should yield similar [[posterior probability|''posterior'' probability]] densities.]] [[File:Beta distribution for 3 different prior probability functions, skewed case sample size = (4,12,40) - J. Rodal.png|thumb|Posterior Beta densities with samples having success = ''s'', failure = ''f'' of ''s''/(''s'' + ''f'') = 1/4, and ''s'' + ''f'' ∈ {4,12,40}, based on three different prior probability functions: Haldane (Beta(0,0), Jeffreys (Beta(1/2,1/2)) and Bayes (Beta(1,1)). The image shows that there is little difference between the priors for the posterior with sample size of 40 (with more pronounced peak near ''p'' = 1/4). Significant differences appear for very small sample sizes]] [[Harold Jeffreys]]<ref name=Jeffreys>{{cite book|last=Jeffreys|first=Harold|title=Theory of Probability|year=1998|publisher=Oxford University Press, 3rd edition|isbn=978-0198503682}}</ref><ref name=JeffreysPRIOR>{{cite journal|last=Jeffreys|first=Harold|title=An Invariant Form for the Prior Probability in Estimation Problems|journal=Proceedings of the Royal Society|date=September 1946|volume=186|series=A 24|issue=1007|pages=453–461|doi=10.1098/rspa.1946.0056|pmid=20998741|bibcode=1946RSPSA.186..453J|doi-access=free}}</ref> proposed to use an [[uninformative prior]] probability measure that should be [[Parametrization invariance|invariant under reparameterization]]: proportional to the square root of the [[determinant]] of [[Fisher's information]] matrix. For the [[Bernoulli distribution]], this can be shown as follows: for a coin that is "heads" with probability ''p'' ∈ [0, 1] and is "tails" with probability 1 − ''p'', for a given (H,T) ∈ {(0,1), (1,0)} the probability is ''p<sup>H</sup>''(1 − ''p'')<sup>''T''</sup>. Since ''T'' = 1 − ''H'', the [[Bernoulli distribution]] is ''p<sup>H</sup>''(1 − ''p'')<sup>1 − ''H''</sup>. Considering ''p'' as the only parameter, it follows that the log likelihood for the Bernoulli distribution is :<math>\ln \mathcal{L} (p\mid H) = H \ln(p)+ (1-H) \ln(1-p).</math> The Fisher information matrix has only one component (it is a scalar, because there is only one parameter: ''p''), therefore: :<math>\begin{align} \sqrt{\mathcal{I}(p)} &= \sqrt{\operatorname{E}\!\left[ \left( \frac{d}{dp} \ln(\mathcal{L} (p\mid H)) \right)^2\right]} \\[6pt] &= \sqrt{\operatorname{E}\!\left[ \left( \frac{H}{p} - \frac{1-H}{1-p}\right)^2 \right]} \\[6pt] &= \sqrt{p^1 (1-p)^0 \left( \frac{1}{p} - \frac{0}{1-p}\right)^2 + p^0 (1-p)^1 \left(\frac{0}{p} - \frac{1}{1-p}\right)^2} \\ &= \frac{1}{\sqrt{p(1-p)}}. \end{align}</math> Similarly, for the [[Binomial distribution]] with ''n'' [[Bernoulli trials]], it can be shown that :<math>\sqrt{\mathcal{I}(p)}= \frac{\sqrt{n}}{\sqrt{p(1-p)}}.</math> Thus, for the [[Bernoulli distribution|Bernoulli]], and [[Binomial distribution]]s, [[Jeffreys prior]] is proportional to <math>\scriptstyle \frac{1}{\sqrt{p(1-p)}}</math>, which happens to be proportional to a beta distribution with domain variable ''x'' = ''p'', and shape parameters α = β = 1/2, the [[arcsine distribution]]: :<math>Beta(\tfrac{1}{2}, \tfrac{1}{2}) = \frac{1}{\pi \sqrt{p(1-p)}}.</math> It will be shown in the next section that the normalizing constant for Jeffreys prior is immaterial to the final result because the normalizing constant cancels out in Bayes' theorem for the posterior probability. Hence Beta(1/2,1/2) is used as the Jeffreys prior for both Bernoulli and binomial distributions. As shown in the next section, when using this expression as a prior probability times the likelihood in [[Bayes' theorem]], the posterior probability turns out to be a beta distribution. It is important to realize, however, that Jeffreys prior is proportional to <math>\scriptstyle \frac{1}{\sqrt{p(1-p)}}</math> for the Bernoulli and binomial distribution, but not for the beta distribution. Jeffreys prior for the beta distribution is given by the determinant of Fisher's information for the beta distribution, which, as shown in the {{section link||Fisher information matrix}} is a function of the [[trigamma function]] ψ<sub>1</sub> of shape parameters α and β as follows: :<math> \begin{align} \sqrt{\det(\mathcal{I}(\alpha, \beta))} &= \sqrt{\psi_1(\alpha)\psi_1(\beta)-(\psi_1(\alpha)+\psi_1(\beta))\psi_1(\alpha + \beta)} \\ \lim_{\alpha\to 0} \sqrt{\det(\mathcal{I}(\alpha, \beta))} &=\lim_{\beta \to 0} \sqrt{\det(\mathcal{I}(\alpha, \beta))} = \infty\\ \lim_{\alpha\to \infty} \sqrt{\det(\mathcal{I}(\alpha, \beta))} &=\lim_{\beta \to \infty} \sqrt{\det(\mathcal{I}(\alpha, \beta))} = 0 \end{align}</math> As previously discussed, Jeffreys prior for the Bernoulli and binomial distributions is proportional to the [[arcsine distribution]] Beta(1/2,1/2), a one-dimensional ''curve'' that looks like a basin as a function of the parameter ''p'' of the Bernoulli and binomial distributions. The walls of the basin are formed by ''p'' approaching the singularities at the ends ''p'' → 0 and ''p'' → 1, where Beta(1/2,1/2) approaches infinity. Jeffreys prior for the beta distribution is a ''2-dimensional surface'' (embedded in a three-dimensional space) that looks like a basin with only two of its walls meeting at the corner α = β = 0 (and missing the other two walls) as a function of the shape parameters α and β of the beta distribution. The two adjoining walls of this 2-dimensional surface are formed by the shape parameters α and β approaching the singularities (of the trigamma function) at α, β → 0. It has no walls for α, β → ∞ because in this case the determinant of Fisher's information matrix for the beta distribution approaches zero. It will be shown in the next section that Jeffreys prior probability results in posterior probabilities (when multiplied by the binomial likelihood function) that are intermediate between the posterior probability results of the Haldane and Bayes prior probabilities. Jeffreys prior may be difficult to obtain analytically, and for some cases it just doesn't exist (even for simple distribution functions like the asymmetric [[triangular distribution]]). Berger, Bernardo and Sun, in a 2009 paper<ref name="BergerBernardoSun">{{cite journal|last=Berger|first=James |author2=Bernardo, Jose |author3=Sun, Dongchu|title=The formal definition of reference priors|journal=The Annals of Statistics|year=2009|volume=37|issue=2|pages=905–938|doi=10.1214/07-AOS587|url= http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?view=body&id=pdfview_1&handle=euclid.aos/1236693154|arxiv=0904.0156|bibcode=2009arXiv0904.0156B |s2cid=3221355 }}</ref> defined a reference prior probability distribution that (unlike Jeffreys prior) exists for the asymmetric [[triangular distribution]]. They cannot obtain a closed-form expression for their reference prior, but numerical calculations show it to be nearly perfectly fitted by the (proper) prior :<math> \operatorname{Beta}(\tfrac{1}{2}, \tfrac{1}{2}) \sim\frac{1}{\sqrt{\theta(1-\theta)}}</math> where θ is the vertex variable for the asymmetric triangular distribution with support [0, 1] (corresponding to the following parameter values in Wikipedia's article on the [[triangular distribution]]: vertex ''c'' = ''θ'', left end ''a'' = 0,and right end ''b'' = 1). Berger et al. also give a heuristic argument that Beta(1/2,1/2) could indeed be the exact Berger–Bernardo–Sun reference prior for the asymmetric triangular distribution. Therefore, Beta(1/2,1/2) not only is Jeffreys prior for the Bernoulli and binomial distributions, but also seems to be the Berger–Bernardo–Sun reference prior for the asymmetric triangular distribution (for which the Jeffreys prior does not exist), a distribution used in project management and [[PERT]] analysis to describe the cost and duration of project tasks. Clarke and Barron<ref>{{cite journal|last=Clarke|first=Bertrand S.|author2=Andrew R. Barron|title=Jeffreys' prior is asymptotically least favorable under entropy risk|journal=Journal of Statistical Planning and Inference|year=1994|volume=41|pages=37–60|url=http://www.stat.yale.edu/~arb4/publications_files/jeffery's%20prior.pdf|doi=10.1016/0378-3758(94)90153-8}}</ref> prove that, among continuous positive priors, Jeffreys prior (when it exists) asymptotically maximizes Shannon's [[mutual information]] between a sample of size n and the parameter, and therefore ''Jeffreys prior is the most uninformative prior'' (measuring information as Shannon information). The proof rests on an examination of the [[Kullback–Leibler divergence]] between probability density functions for [[iid]] random variables. ====Effect of different prior probability choices on the posterior beta distribution==== If samples are drawn from the population of a random variable ''X'' that result in ''s'' successes and ''f'' failures in ''n'' [[Bernoulli trial]]s ''n'' = ''s'' + ''f'', then the [[likelihood function]] for parameters ''s'' and ''f'' given ''x'' = ''p'' (the notation ''x'' = ''p'' in the expressions below will emphasize that the domain ''x'' stands for the value of the parameter ''p'' in the binomial distribution), is the following [[binomial distribution]]: :<math>\mathcal{L}(s,f\mid x=p) = {s+f \choose s} x^s(1-x)^f = {n \choose s} x^s(1-x)^{n - s}. </math> If beliefs about [[prior probability]] information are reasonably well approximated by a beta distribution with parameters ''α'' Prior and ''β'' Prior, then: :<math>{\operatorname{PriorProbability}}(x=p;\alpha \operatorname{Prior},\beta \operatorname{Prior}) = \frac{ x^{\alpha \operatorname{Prior}-1}(1-x)^{\beta \operatorname{Prior}-1}}{\Beta(\alpha \operatorname{Prior},\beta \operatorname{Prior})}</math> According to [[Bayes' theorem]] for a continuous event space, the [[posterior probability]] density is given by the product of the [[prior probability]] and the likelihood function (given the evidence ''s'' and ''f'' = ''n'' − ''s''), normalized so that the area under the curve equals one, as follows: :<math>\begin{align} & \text{posterior probability density}(x=p\mid s,n-s) \\[6pt] = {} & \frac{\operatorname{prior probability density}(x=p;\alpha \operatorname{prior},\beta \operatorname{prior}) \mathcal{L}(s,f\mid x=p)} {\int_0^1\text{prior probability density}(x=p;\alpha \operatorname{prior},\beta \operatorname{prior}) \mathcal{L}(s,f\mid x=p) \, dx} \\[6pt] = {} & \frac{{{n \choose s} x^{s+\alpha \operatorname{prior}-1}(1-x)^{n-s+\beta \operatorname{prior}-1} / \Beta(\alpha \operatorname{prior},\beta \operatorname{prior})}}{\int_0^1 \left({n \choose s} x^{s+\alpha \operatorname{prior}-1}(1-x)^{n-s+\beta \operatorname{prior}-1} /\Beta(\alpha \operatorname{prior}, \beta \operatorname{prior})\right) \, dx} \\[6pt] = {} & \frac{x^{s+\alpha \operatorname{prior}-1}(1-x)^{n-s+\beta \operatorname{prior}-1}}{\int_0^1 \left(x^{s+\alpha \operatorname{prior}-1}(1-x)^{n-s+\beta \operatorname{prior}-1}\right) \, dx} \\[6pt] = {} & \frac{x^{s+\alpha \operatorname{prior}-1}(1-x)^{n-s+\beta \operatorname{prior}-1}}{\Beta(s+\alpha \operatorname{prior},n-s+\beta \operatorname{prior})}. \end{align}</math> The [[binomial coefficient]] :<math>{s+f \choose s}={n \choose s}=\frac{(s+f)!}{s! f!}=\frac{n!}{s!(n-s)!}</math> appears both in the numerator and the denominator of the posterior probability, and it does not depend on the integration variable ''x'', hence it cancels out, and it is irrelevant to the final result. Similarly the normalizing factor for the prior probability, the beta function B(αPrior,βPrior) cancels out and it is immaterial to the final result. The same posterior probability result can be obtained if one uses an un-normalized prior :<math>x^{\alpha \operatorname{prior}-1}(1-x)^{\beta \operatorname{prior}-1}</math> because the normalizing factors all cancel out. Several authors (including Jeffreys himself) thus use an un-normalized prior formula since the normalization constant cancels out. The numerator of the posterior probability ends up being just the (un-normalized) product of the prior probability and the likelihood function, and the denominator is its integral from zero to one. The beta function in the denominator, B(''s'' + ''α'' Prior, ''n'' − ''s'' + ''β'' Prior), appears as a normalization constant to ensure that the total posterior probability integrates to unity. The ratio ''s''/''n'' of the number of successes to the total number of trials is a [[sufficient statistic]] in the binomial case, which is relevant for the following results. For the '''Bayes'''' prior probability (Beta(1,1)), the posterior probability is: :<math>\operatorname{posterior probability}(p=x\mid s,f) = \frac{x^{s}(1-x)^{n-s}}{\Beta(s+1,n-s+1)}, \text{ with mean }=\frac{s+1}{n+2},\text{ (and mode}=\frac{s}{n}\text{ if } 0 < s < n).</math> For the '''Jeffreys'''' prior probability (Beta(1/2,1/2)), the posterior probability is: :<math>\operatorname{posterior probability}(p=x\mid s,f) = {x^{s-\tfrac{1}{2}}(1-x)^{n-s-\frac{1}{2}} \over \Beta(s+\tfrac{1}{2},n-s+\tfrac{1}{2})} ,\text{ with mean} = \frac{s+\tfrac{1}{2}}{n+1},\text{ (and mode}=\frac{s-\tfrac{1}{2}}{n-1}\text{ if } \tfrac{1}{2} < s < n-\tfrac{1}{2}).</math> and for the '''Haldane''' prior probability (Beta(0,0)), the posterior probability is: :<math>\operatorname{posterior probability}(p=x\mid s,f) = \frac{x^{s-1}(1-x)^{n-s-1}}{\Beta(s,n-s)}, \text{ with mean} = \frac{s}{n},\text{ (and mode}=\frac{s-1}{n-2}\text{ if } 1 < s < n -1).</math> From the above expressions it follows that for ''s''/''n'' = 1/2) all the above three prior probabilities result in the identical location for the posterior probability mean = mode = 1/2. For ''s''/''n'' < 1/2, the mean of the posterior probabilities, using the following priors, are such that: mean for Bayes prior > mean for Jeffreys prior > mean for Haldane prior. For ''s''/''n'' > 1/2 the order of these inequalities is reversed such that the Haldane prior probability results in the largest posterior mean. The ''Haldane'' prior probability Beta(0,0) results in a posterior probability density with ''mean'' (the expected value for the probability of success in the "next" trial) identical to the ratio ''s''/''n'' of the number of successes to the total number of trials. Therefore, the Haldane prior results in a posterior probability with expected value in the next trial equal to the maximum likelihood. The ''Bayes'' prior probability Beta(1,1) results in a posterior probability density with ''mode'' identical to the ratio ''s''/''n'' (the maximum likelihood). In the case that 100% of the trials have been successful ''s'' = ''n'', the ''Bayes'' prior probability Beta(1,1) results in a posterior expected value equal to the rule of succession (''n'' + 1)/(''n'' + 2), while the Haldane prior Beta(0,0) results in a posterior expected value of 1 (absolute certainty of success in the next trial). Jeffreys prior probability results in a posterior expected value equal to (''n'' + 1/2)/(''n'' + 1). Perks<ref name=Perks/> (p. 303) points out: "This provides a new rule of succession and expresses a 'reasonable' position to take up, namely, that after an unbroken run of n successes we assume a probability for the next trial equivalent to the assumption that we are about half-way through an average run, i.e. that we expect a failure once in (2''n'' + 2) trials. The Bayes–Laplace rule implies that we are about at the end of an average run or that we expect a failure once in (''n'' + 2) trials. The comparison clearly favours the new result (what is now called Jeffreys prior) from the point of view of 'reasonableness'." Conversely, in the case that 100% of the trials have resulted in failure (''s'' = 0), the ''Bayes'' prior probability Beta(1,1) results in a posterior expected value for success in the next trial equal to 1/(''n'' + 2), while the Haldane prior Beta(0,0) results in a posterior expected value of success in the next trial of 0 (absolute certainty of failure in the next trial). Jeffreys prior probability results in a posterior expected value for success in the next trial equal to (1/2)/(''n'' + 1), which Perks<ref name=Perks/> (p. 303) points out: "is a much more reasonably remote result than the Bayes–Laplace result 1/(''n'' + 2)". Jaynes<ref name=Jaynes/> questions (for the uniform prior Beta(1,1)) the use of these formulas for the cases ''s'' = 0 or ''s'' = ''n'' because the integrals do not converge (Beta(1,1) is an improper prior for ''s'' = 0 or ''s'' = ''n''). In practice, the conditions 0<s<n necessary for a mode to exist between both ends for the Bayes prior are usually met, and therefore the Bayes prior (as long as 0 < ''s'' < ''n'') results in a posterior mode located between both ends of the domain. As remarked in the section on the rule of succession, K. Pearson showed that after ''n'' successes in ''n'' trials the posterior probability (based on the Bayes Beta(1,1) distribution as the prior probability) that the next (''n'' + 1) trials will all be successes is exactly 1/2, whatever the value of ''n''. Based on the Haldane Beta(0,0) distribution as the prior probability, this posterior probability is 1 (absolute certainty that after n successes in ''n'' trials the next (''n'' + 1) trials will all be successes). Perks<ref name=Perks/> (p. 303) shows that, for what is now known as the Jeffreys prior, this probability is ((''n'' + 1/2)/(''n'' + 1))((''n'' + 3/2)/(''n'' + 2))...(2''n'' + 1/2)/(2''n'' + 1), which for ''n'' = 1, 2, 3 gives 15/24, 315/480, 9009/13440; rapidly approaching a limiting value of <math>1/\sqrt{2} = 0.70710678\ldots</math> as n tends to infinity. Perks remarks that what is now known as the Jeffreys prior: "is clearly more 'reasonable' than either the Bayes–Laplace result or the result on the (Haldane) alternative rule rejected by Jeffreys which gives certainty as the probability. It clearly provides a very much better correspondence with the process of induction. Whether it is 'absolutely' reasonable for the purpose, i.e. whether it is yet large enough, without the absurdity of reaching unity, is a matter for others to decide. But it must be realized that the result depends on the assumption of complete indifference and absence of knowledge prior to the sampling experiment." Following are the variances of the posterior distribution obtained with these three prior probability distributions: for the '''Bayes'''' prior probability (Beta(1,1)), the posterior variance is: :<math>\text{variance} = \frac{(n-s+1)(s+1)}{(3+n)(2+n)^2},\text{ which for } s=\frac{n}{2} \text{ results in variance} =\frac{1}{12+4n}</math> for the '''Jeffreys'''' prior probability (Beta(1/2,1/2)), the posterior variance is: : <math>\text{variance} = \frac{(n-s+\frac{1}{2})(s+\frac{1}{2})}{(2+n)(1+n)^2} ,\text{ which for } s=\frac n 2 \text{ results in var} = \frac 1 {8 + 4n}</math> and for the '''Haldane''' prior probability (Beta(0,0)), the posterior variance is: :<math>\text{variance} = \frac{(n-s)s}{(1+n)n^2}, \text{ which for }s=\frac{n}{2}\text{ results in variance} =\frac{1}{4+4n}</math> So, as remarked by Silvey,<ref name=Silvey/> for large ''n'', the variance is small and hence the posterior distribution is highly concentrated, whereas the assumed prior distribution was very diffuse. This is in accord with what one would hope for, as vague prior knowledge is transformed (through Bayes' theorem) into a more precise posterior knowledge by an informative experiment. For small ''n'' the Haldane Beta(0,0) prior results in the largest posterior variance while the Bayes Beta(1,1) prior results in the more concentrated posterior. Jeffreys prior Beta(1/2,1/2) results in a posterior variance in between the other two. As ''n'' increases, the variance rapidly decreases so that the posterior variance for all three priors converges to approximately the same value (approaching zero variance as ''n'' → ∞). Recalling the previous result that the ''Haldane'' prior probability Beta(0,0) results in a posterior probability density with ''mean'' (the expected value for the probability of success in the "next" trial) identical to the ratio s/n of the number of successes to the total number of trials, it follows from the above expression that also the ''Haldane'' prior Beta(0,0) results in a posterior with ''variance'' identical to the variance expressed in terms of the max. likelihood estimate s/n and sample size (in {{section link||Variance}}): :<math>\text{variance} = \frac{\mu(1-\mu)}{1 + \nu}= \frac{(n-s)s}{(1+n) n^2} </math> with the mean ''μ'' = ''s''/''n'' and the sample size ''ν'' = ''n''. In Bayesian inference, using a [[prior distribution]] Beta(''α''Prior,''β''Prior) prior to a binomial distribution is equivalent to adding (''α''Prior − 1) pseudo-observations of "success" and (''β''Prior − 1) pseudo-observations of "failure" to the actual number of successes and failures observed, then estimating the parameter ''p'' of the binomial distribution by the proportion of successes over both real- and pseudo-observations. A uniform prior Beta(1,1) does not add (or subtract) any pseudo-observations since for Beta(1,1) it follows that (''α''Prior − 1) = 0 and (''β''Prior − 1) = 0. The Haldane prior Beta(0,0) subtracts one pseudo observation from each and Jeffreys prior Beta(1/2,1/2) subtracts 1/2 pseudo-observation of success and an equal number of failure. This subtraction has the effect of [[smoothing]] out the posterior distribution. If the proportion of successes is not 50% (''s''/''n'' ≠ 1/2) values of ''α''Prior and ''β''Prior less than 1 (and therefore negative (''α''Prior − 1) and (''β''Prior − 1)) favor sparsity, i.e. distributions where the parameter ''p'' is closer to either 0 or 1. In effect, values of ''α''Prior and ''β''Prior between 0 and 1, when operating together, function as a [[concentration parameter]]. The accompanying plots show the posterior probability density functions for sample sizes ''n'' ∈ {3,10,50}, successes ''s'' ∈ {''n''/2,''n''/4} and Beta(''α''Prior,''β''Prior) ∈ {Beta(0,0),Beta(1/2,1/2),Beta(1,1)}. Also shown are the cases for ''n'' = {4,12,40}, success ''s'' = {''n''/4} and Beta(''α''Prior,''β''Prior) ∈ {Beta(0,0),Beta(1/2,1/2),Beta(1,1)}. The first plot shows the symmetric cases, for successes ''s'' ∈ {n/2}, with mean = mode = 1/2 and the second plot shows the skewed cases ''s'' ∈ {''n''/4}. The images show that there is little difference between the priors for the posterior with sample size of 50 (characterized by a more pronounced peak near ''p'' = 1/2). Significant differences appear for very small sample sizes (in particular for the flatter distribution for the degenerate case of sample size = 3). Therefore, the skewed cases, with successes ''s'' = {''n''/4}, show a larger effect from the choice of prior, at small sample size, than the symmetric cases. For symmetric distributions, the Bayes prior Beta(1,1) results in the most "peaky" and highest posterior distributions and the Haldane prior Beta(0,0) results in the flattest and lowest peak distribution. The Jeffreys prior Beta(1/2,1/2) lies in between them. For nearly symmetric, not too skewed distributions the effect of the priors is similar. For very small sample size (in this case for a sample size of 3) and skewed distribution (in this example for ''s'' ∈ {''n''/4}) the Haldane prior can result in a reverse-J-shaped distribution with a singularity at the left end. However, this happens only in degenerate cases (in this example ''n'' = 3 and hence ''s'' = 3/4 < 1, a degenerate value because s should be greater than unity in order for the posterior of the Haldane prior to have a mode located between the ends, and because ''s'' = 3/4 is not an integer number, hence it violates the initial assumption of a binomial distribution for the likelihood) and it is not an issue in generic cases of reasonable sample size (such that the condition 1 < ''s'' < ''n'' − 1, necessary for a mode to exist between both ends, is fulfilled). In Chapter 12 (p. 385) of his book, Jaynes<ref name=Jaynes/> asserts that the ''Haldane prior'' Beta(0,0) describes a ''prior state of knowledge of complete ignorance'', where we are not even sure whether it is physically possible for an experiment to yield either a success or a failure, while the ''Bayes (uniform) prior Beta(1,1) applies if'' one knows that ''both binary outcomes are possible''. Jaynes states: "''interpret the Bayes–Laplace (Beta(1,1)) prior as describing not a state of complete ignorance'', but the state of knowledge in which we have observed one success and one failure...once we have seen at least one success and one failure, then we know that the experiment is a true binary one, in the sense of physical possibility." Jaynes <ref name=Jaynes/> does not specifically discuss Jeffreys prior Beta(1/2,1/2) (Jaynes discussion of "Jeffreys prior" on pp. 181, 423 and on chapter 12 of Jaynes book<ref name=Jaynes/> refers instead to the improper, un-normalized, prior "1/''p'' ''dp''" introduced by Jeffreys in the 1939 edition of his book,<ref name=Jeffreys/> seven years before he introduced what is now known as Jeffreys' invariant prior: the square root of the determinant of Fisher's information matrix. ''"1/p" is Jeffreys' (1946) invariant prior for the [[exponential distribution]], not for the Bernoulli or binomial distributions''). However, it follows from the above discussion that Jeffreys Beta(1/2,1/2) prior represents a state of knowledge in between the Haldane Beta(0,0) and Bayes Beta (1,1) prior. Similarly, [[Karl Pearson]] in his 1892 book [[The Grammar of Science]]<ref name=PearsonGrammar>{{cite book| last=Pearson|first=Karl|title=The Grammar of Science|year=1892|publisher=Walter Scott, London|url=https://books.google.com/books?id=IvdsEcFwcnsC&q=grammar+of+science&pg=PR19}}</ref><ref name=PearsnGrammar2009>{{cite book|last=Pearson|first=Karl|title=The Grammar of Science|year=2009|publisher=BiblioLife|isbn=978-1110356119}}</ref> (p. 144 of 1900 edition) maintained that the Bayes (Beta(1,1) uniform prior was not a complete ignorance prior, and that it should be used when prior information justified to "distribute our ignorance equally"". K. Pearson wrote: "Yet the only supposition that we appear to have made is this: that, knowing nothing of nature, routine and anomy (from the Greek ανομία, namely: a- "without", and nomos "law") are to be considered as equally likely to occur. Now we were not really justified in making even this assumption, for it involves a knowledge that we do not possess regarding nature. We use our ''experience'' of the constitution and action of coins in general to assert that heads and tails are equally probable, but we have no right to assert before experience that, as we know nothing of nature, routine and breach are equally probable. In our ignorance we ought to consider before experience that nature may consist of all routines, all anomies (normlessness), or a mixture of the two in any proportion whatever, and that all such are equally probable. Which of these constitutions after experience is the most probable must clearly depend on what that experience has been like." If there is sufficient [[Sample (statistics)|sampling data]], ''and the posterior probability mode is not located at one of the extremes of the domain'' (''x'' = 0 or ''x'' = 1), the three priors of Bayes (Beta(1,1)), Jeffreys (Beta(1/2,1/2)) and Haldane (Beta(0,0)) should yield similar [[posterior probability|''posterior'' probability]] densities. Otherwise, as Gelman et al.<ref name=Gelman>{{cite book|last=Gelman|first=A., Carlin, J. B., Stern, H. S., and Rubin, D. B.|title=Bayesian Data Analysis| year=2003|publisher=Chapman and Hall/CRC|isbn=978-1584883883}}</ref> (p. 65) point out, "if so few data are available that the choice of noninformative prior distribution makes a difference, one should put relevant information into the prior distribution", or as Berger<ref name=BergerDecisionTheory/> (p. 125) points out "when different reasonable priors yield substantially different answers, can it be right to state that there ''is'' a single answer? Would it not be better to admit that there is scientific uncertainty, with the conclusion depending on prior beliefs?."
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)