Editing Prior probability (section)

== Uninformative priors ==  
An ''uninformative'', ''flat'', or ''diffuse prior'' expresses vague or general information about a variable.<ref name="Zellner1971" /> The term "uninformative prior" is somewhat of a misnomer. Such a prior might also be called a ''not very informative prior'', or an ''objective prior'', i.e., one that is not subjectively elicited.

Uninformative priors can express "objective" information such as "the variable is positive" or "the variable is less than some limit". The simplest and oldest rule for determining a non-informative prior is the [[principle of indifference]], which assigns equal probabilities to all possibilities. In parameter estimation problems, the use of an uninformative prior typically yields results which are not too different from conventional statistical analysis, as the likelihood function often yields more information than the uninformative prior.

Some attempts have been made at finding [[a priori probability|a priori probabilities]], i.e., probability distributions in some sense logically required by the nature of one's state of uncertainty; these are a subject of philosophical controversy, with Bayesians being roughly divided into two schools: "objective Bayesians", who believe such priors exist in many useful situations, and "subjective Bayesians" who believe that in practice priors usually represent subjective judgements of opinion that cannot be rigorously justified (Williamson 2010).  Perhaps the strongest arguments for objective Bayesianism were given by [[Edwin T. Jaynes]], based mainly on the consequences of symmetries and on the principle of maximum entropy.

As an example of an a priori prior, due to Jaynes (2003), consider a situation in which one knows a ball has been hidden under one of three cups, A, B, or C, but no other information is available about its location.  In this case a ''uniform prior'' of ''p''(''A'')&nbsp;= ''p''(''B'')&nbsp;= ''p''(''C'')&nbsp;= 1/3 seems intuitively like the only reasonable choice.  More formally, we can see that the problem remains the same if we swap around the labels ("A", "B" and "C") of the cups.  It would therefore be odd to choose a prior for which a permutation of the labels would cause a change in our predictions about which cup the ball will be found under; the uniform prior is the only one which preserves this invariance.  If one accepts this invariance principle then one can see that the uniform prior is the logically correct prior to represent this state of knowledge. This prior is "objective" in the sense of being the correct choice to represent a particular state of knowledge, but it is not objective in the sense of being an observer-independent feature of the world: in reality the ball exists under a particular cup, and it only makes sense to speak of probabilities in this situation if there is an observer with limited knowledge about the system.<ref>{{cite book |first1=Jean-Pierre |last1=Florens |first2=Michael |last2=Mouchart |first3=Jean-Marie |last3=Rolin |chapter=Invariance Arguments in Bayesian Statistics |title=Economic Decision-Making: Games, Econometrics and Optimisation |location= |publisher=North-Holland |year=1990 |pages=351–367 |isbn=0-444-88422-X }}</ref>

As a more contentious example, Jaynes published an argument based on the invariance of the prior under a change of parameters that suggests that the prior representing complete uncertainty about a probability should be the [[Haldane prior]] ''p''<sup>&minus;1</sup>(1&nbsp;&minus;&nbsp;''p'')<sup>&minus;1</sup>.<ref name="Jaynes1968">{{cite journal|last=Jaynes |first=Edwin T. |author-link= Edwin T. Jaynes |title=Prior Probabilities |journal=IEEE Transactions on Systems Science and Cybernetics |volume=4 |issue=3 |pages=227&ndash;241 |date=Sep 1968 |doi=10.1109/TSSC.1968.300117 |url=http://bayes.wustl.edu/etj/articles/prior.pdf }}</ref>  The example Jaynes gives is of finding a chemical in a lab and asking whether it will dissolve in water in repeated experiments.  The Haldane prior<ref>This prior was proposed by [[J.B.S. Haldane]] in "A note on inverse probability", Mathematical Proceedings of the Cambridge Philosophical Society 28, 55–61, 1932, {{doi|10.1017/S0305004100010495}}. See also J. Haldane, "The precision of observed values of small frequencies", Biometrika, 35:297–300, 1948, {{doi|10.2307/2332350}}, {{JSTOR|2332350}}.</ref> gives by far the most weight to <math>p=0</math> and <math>p=1</math>, indicating that the sample will either dissolve every time or never dissolve, with equal probability.  However, if one has observed samples of the chemical to dissolve in one experiment and not to dissolve in another experiment then this prior is updated to the [[uniform distribution (continuous)|uniform distribution]] on the interval [0, 1].  This is obtained by applying [[Bayes' theorem]] to the data set consisting of one observation of dissolving and one of not dissolving, using the above prior.  The Haldane prior is an improper prior distribution (meaning that it has an infinite mass).  [[Harold Jeffreys]] devised a systematic way for designing uninformative priors as e.g., [[Jeffreys prior]] ''p''<sup>&minus;1/2</sup>(1&nbsp;&minus;&nbsp;''p'')<sup>&minus;1/2</sup> for the Bernoulli random variable.

Priors can be constructed which are proportional to the [[Haar measure]] if the parameter space ''X'' carries a [[transformation group|natural group structure]] which leaves invariant our Bayesian state of knowledge.<ref name="Jaynes1968" /> This can be seen as a generalisation of the invariance principle used to justify the uniform prior over the three cups in the example above.  For example, in physics we might expect that an experiment will give the same results regardless of our choice of the origin of a coordinate system. This induces the group structure of the [[translation group]] on ''X'', which determines the prior probability as a constant [[improper prior]]. Similarly, some measurements are naturally invariant to the choice of an arbitrary scale (e.g., whether centimeters or inches are used, the physical results should be equal). In such a case, the scale group is the natural group structure, and the corresponding prior on ''X'' is proportional to 1/''x''. It sometimes matters whether we use the left-invariant or right-invariant Haar measure. For example, the left and right invariant Haar measures on the [[affine group]] are not equal. Berger (1985, p.&nbsp;413) argues that the right-invariant Haar measure is the correct choice.

Another idea, championed by [[Edwin T. Jaynes]], is to use the [[principle of maximum entropy]] (MAXENT). The motivation is that the [[Shannon entropy]] of a probability distribution measures the amount of information contained in the distribution. The larger the entropy, the less information is provided by the distribution. Thus, by maximizing the entropy over a suitable set of probability distributions on ''X'', one finds the distribution that is least informative in the sense that it contains the least amount of information consistent with the constraints that define the set. For example, the maximum entropy prior on a discrete space, given only that the probability is normalized to 1, is the prior that assigns equal probability to each state. And in the continuous case, the maximum entropy prior given that the density is normalized with mean zero and unit variance is the standard [[normal distribution]].  The principle of ''[[minxent|minimum cross-entropy]]'' generalizes MAXENT to the case of "updating" an arbitrary prior distribution with suitable constraints in the maximum-entropy sense.

A related idea, [[reference prior]]s, was introduced by [[José-Miguel Bernardo]]. Here, the idea is to maximize the expected [[Kullback–Leibler divergence]] of the posterior distribution relative to the prior. This maximizes the expected posterior information about ''X'' when the prior density is ''p''(''x''); thus, in some sense, ''p''(''x'') is the "least informative" prior about X. The reference prior is defined in the asymptotic limit, i.e., one considers the limit of the priors so obtained as the number of data points goes to infinity.  In the present case, the KL divergence between the prior and posterior distributions is given by
<math display="block"> KL = \int p(t) \int p(x\mid t) \log\frac{p(x\mid t)}{p(x)} \, dx \, dt </math>

Here, <math> t </math> is a sufficient statistic for some parameter <math> x </math>.  The inner integral is the KL divergence between the posterior <math> p(x\mid t) </math> and prior <math> p(x) </math> distributions and the result is the weighted mean over all values of <math> t </math>.   Splitting the logarithm into two parts, reversing the order of integrals in the second part and noting that <math display="block"> \log \, [p(x)] </math> does not depend on <math> t </math> yields
<math display="block"> KL = \int p(t) \int p(x\mid t) \log[p(x\mid t)] \, dx \, dt \, - \, \int \log[p(x)] \, \int  p(t) p(x\mid t) \, dt \, dx </math>

The inner integral in the second part is the integral over <math> t </math> of the joint density <math> p(x, t) </math>.  This is the marginal distribution <math> p(x) </math>, so we have
<math display="block"> KL = \int p(t) \int p(x\mid t) \log[p(x\mid t)] \, dx \, dt \, - \, \int p(x) \log[p(x)] \, dx </math>

Now we use the concept of entropy which, in the case of probability distributions, is the negative expected value of the logarithm of the probability mass or density function or <math display="inline"> H(x) = - \int p(x) \log[p(x)] \, dx. </math>  Using this in the last equation yields
<math display="block"> KL = - \int p(t)H(x\mid t) \, dt + \, H(x) </math>

In words, KL is the negative expected value over <math> t </math> of the entropy of <math> x </math> conditional on <math> t </math> plus the marginal (i.e., unconditional) entropy of <math> x </math>.  In the limiting case where the sample size tends to infinity, the [[Bernstein-von Mises theorem]] states that the distribution of <math> x </math> conditional on a given observed value of <math> t </math> is normal with a variance equal to the reciprocal of the Fisher information at the 'true' value of <math> x </math>.  The entropy of a normal density function is equal to half the logarithm of <math> 2 \pi ev </math> where <math> v </math> is the variance of the distribution.  In this case therefore <math display="block"> H = \log \sqrt{\frac{2\pi e}{NI(x^*)}} </math> where <math> N </math> is the arbitrarily large sample size (to which Fisher information is proportional) and <math> x* </math> is the 'true' value.  Since this does not depend on <math> t </math> it can be taken out of the integral, and as this integral is over a probability space it equals one.  Hence we can write the asymptotic form of KL as
<math display="block"> KL = - \log \left(1\sqrt{kI(x^*)}\right) - \, \int p(x) \log[p(x)] \, dx </math>
where <math> k </math> is proportional to the (asymptotically large) sample size.  We do not know the value of <math> x* </math>.  Indeed, the very idea goes against the philosophy of Bayesian inference in which 'true' values of parameters are replaced by prior and posterior distributions.  So we remove <math> x* </math> by replacing it with <math> x </math> and taking the expected value of the normal entropy, which we obtain by multiplying by <math> p(x) </math> and integrating over <math> x </math>.  This allows us to combine the logarithms yielding
<math display="block"> KL = - \int p(x) \log \left[\frac{p(x)}{\sqrt{kI(x)}}\right] \, dx </math>

This is a quasi-KL divergence ("quasi" in the sense that the square root of the Fisher information may be the kernel of an improper distribution).  Due to the minus sign, we need to minimise this in order to maximise the KL divergence with which we started.  The minimum value of the last equation occurs where the two distributions in the logarithm argument, improper or not, do not diverge.  This in turn occurs when the prior distribution is proportional to the square root of the Fisher information of the likelihood function.  Hence in the single parameter case, reference priors and Jeffreys priors are identical, even though Jeffreys has a very different rationale.

Reference priors are often the objective prior of choice in multivariate problems, since other rules (e.g., [[Jeffreys prior|Jeffreys' rule]]) may result in priors with problematic behavior.{{clarify|post-text=A Jeffreys prior is related to KL divergence?|date=September 2015}}

Objective prior distributions may also be derived from other principles, such as [[information theory|information]] or [[coding theory]] (see e.g., [[minimum description length]]) or [[frequentist statistics]] (so-called [[probability matching prior]]s).<ref>{{cite book |first1=Gauri Sankar |last1=Datta |first2=Rahul |last2=Mukerjee |title=Probability Matching Priors: Higher Order Asymptotics |location= |publisher=Springer |year=2004 |isbn=978-0-387-20329-4 }}</ref> Such methods are used in [[Solomonoff's theory of inductive inference]]. Constructing objective priors have been recently introduced in bioinformatics, and specially inference in cancer systems biology, where sample size is limited and a vast amount of '''prior knowledge''' is available. In these methods, either an information theory based criterion, such as KL divergence or log-likelihood function for binary supervised learning problems<ref>{{Cite journal|title=Incorporation of Biological Pathway Knowledge in the Construction of Priors for Optimal Bayesian Classification - IEEE Journals & Magazine|journal=IEEE/ACM Transactions on Computational Biology and Bioinformatics|language=en-US|doi=10.1109/TCBB.2013.143|pmid=26355519|year=2014|last1=Esfahani|first1=M. S.|last2=Dougherty|first2=E. R.|volume=11|issue=1|pages=202–18|s2cid=10096507}}</ref> and mixture model problems.<ref>{{Cite journal|last1=Boluki|first1=Shahin|last2=Esfahani|first2=Mohammad Shahrokh|last3=Qian|first3=Xiaoning|last4=Dougherty|first4=Edward R|date=December 2017|title=Incorporating biological prior knowledge for Bayesian learning via maximal knowledge-driven information priors|journal=BMC Bioinformatics|language=En|volume=18|issue=S14|pages=552|doi=10.1186/s12859-017-1893-4|issn=1471-2105|pmc=5751802|pmid=29297278 |doi-access=free }}</ref>

Philosophical problems associated with uninformative priors are associated with the choice of an appropriate metric, or measurement scale. Suppose we want a prior for the running speed of a runner who is unknown to us. We could specify, say, a normal distribution as the prior for his speed, but alternatively we could specify a normal prior for the time he takes to complete 100 metres, which is proportional to the reciprocal of the first prior. These are very different priors, but it is not clear which is to be preferred.  Jaynes' [[Principle of transformation groups|method of transformation groups]] can answer this question in some situations.<ref>Jaynes (1968), pp. 17, see also Jaynes (2003), chapter 12.  Note that chapter 12 is not available in the online preprint but can be previewed via Google Books.</ref>

Similarly, if asked to estimate an unknown proportion between 0 and 1, we might say that all proportions are equally likely, and use a uniform prior. Alternatively, we might say that all orders of magnitude for the proportion are equally likely, the '''{{visible anchor|logarithmic prior}}''', which is the uniform prior on the logarithm of proportion. The [[Jeffreys prior]] attempts to solve this problem by computing a prior which expresses the same belief no matter which metric is used. The Jeffreys prior for an unknown proportion ''p'' is ''p''<sup>&minus;1/2</sup>(1&nbsp;&minus;&nbsp;''p'')<sup>&minus;1/2</sup>, which differs from Jaynes' recommendation.

Priors based on notions of [[algorithmic probability]] are used in [[inductive inference]] as a basis for induction in very general settings.

Practical problems associated with uninformative priors include the requirement that the posterior distribution be proper. The usual uninformative priors on continuous, unbounded variables are improper. This need not be a problem if the posterior distribution is proper. Another issue of importance is that if an uninformative prior is to be used ''routinely'', i.e., with many different data sets, it should have good [[frequentist]] properties. Normally a [[Bayesian probability|Bayesian]] would not be concerned with such issues, but it can be important in this situation. For example, one would want any [[decision theory|decision rule]] based on the posterior distribution to be [[admissible decision rule|admissible]] under the adopted loss function. Unfortunately, admissibility is often difficult to check, although some results are known (e.g., Berger and Strawderman 1996). The issue is particularly acute with [[hierarchical Bayes model]]s; the usual priors (e.g., Jeffreys' prior) may give badly inadmissible decision rules if employed at the higher levels of the hierarchy.