Editing Bayesian inference (section)

==Mathematical properties==
{{More footnotes needed|section|date=February 2012}}

===Interpretation of factor===

<math display="inline"> \frac{P(E \mid M)}{P(E)} > 1 \Rightarrow P(E \mid M) > P(E)</math>. That is, if the model were true, the evidence would be more likely than is predicted by the current state of belief. The reverse applies for a decrease in belief. If the belief does not change, <math display="inline"> \frac{P(E \mid M)}{P(E)} = 1 \Rightarrow P(E \mid M) = P(E)</math>. That is, the evidence is independent of the model. If the model were true, the evidence would be exactly as likely as predicted by the current state of belief.

===Cromwell's rule===

{{Main|Cromwell's rule}}

If <math>P(M) = 0</math> then <math>P(M \mid E) = 0</math>. If <math>P(M) = 1</math> and <math>P(E) > 0</math>, then <math>P(M|E) = 1</math>. This can be interpreted to mean that hard convictions are insensitive to counter-evidence.

The former follows directly from Bayes' theorem. The latter can be derived by applying the first rule to the event "not <math>M</math>" in place of "<math>M</math>", yielding "if <math>1 - P(M) = 0</math>, then <math>1 - P(M \mid E) = 0</math>", from which the result immediately follows.

===Asymptotic behaviour of posterior===

Consider the behaviour of a belief distribution as it is updated a large number of times with [[independent and identically distributed]] trials. For sufficiently nice prior probabilities, the [[Bernstein–von Mises theorem|Bernstein-von Mises theorem]] gives that in the limit of infinite trials, the posterior converges to a [[Gaussian distribution]] independent of the initial prior under some conditions firstly outlined and rigorously proven by [[Joseph L. Doob]] in 1948, namely if the random variable in consideration has a finite [[probability space]]. The more general results were obtained later by the statistician [[David A. Freedman (statistician)|David A. Freedman]] who published in two seminal research papers in 1963 <ref>{{cite journal| last1=Freedman|first1=DA|title=On the asymptotic behavior of Bayes' estimates in the discrete case|journal=The Annals of Mathematical Statistics|volume=34|issue=4|date=1963|pages=1386–1403|jstor=2238346|doi=10.1214/aoms/1177703871|doi-access=free}}</ref> and 1965 <ref>{{cite journal|last1=Freedman|first1=DA|title=On the asymptotic behavior of Bayes estimates in the discrete case II|journal=The Annals of Mathematical Statistics|date=1965|volume=36|issue=2|pages=454–456|jstor=2238150|doi=10.1214/aoms/1177700155|doi-access=free}}</ref> when and under what circumstances the asymptotic behaviour of posterior is guaranteed. His 1963 paper treats, like Doob (1949), the finite case and comes to a satisfactory conclusion. However, if the random variable has an infinite but countable [[probability space]] (i.e., corresponding to a die with infinite many faces) the 1965 paper demonstrates that for a dense subset of priors the [[Bernstein–von Mises theorem|Bernstein-von Mises theorem]] is not applicable. In this case there is [[almost surely]] no asymptotic convergence. Later in the 1980s and 1990s [[David A. Freedman (statistician)|Freedman]] and [[Persi Diaconis]] continued to work on the case of infinite countable probability spaces.<ref>{{cite journal|first2=Larry|last2= Wasserman |first1 = James|last1 =Robins|journal =   Journal of the American Statistical Association|date = 2000|title = Conditioning, likelihood, and coherence: A review of some foundational concepts|doi=10.1080/01621459.2000.10474344|volume=95|issue=452| pages=1340–1346|s2cid= 120767108 }}</ref> To summarise, there may be insufficient trials to suppress the effects of the initial choice, and especially for large (but finite) systems the convergence might be very slow.

===Conjugate priors===
{{Main|Conjugate prior}}

In parameterized form, the prior distribution is often assumed to come from a family of distributions called [[conjugate prior]]s. The usefulness of a conjugate prior is that the corresponding posterior distribution will be in the same family, and the calculation may be expressed in [[Closed-form expression|closed form]].

===Estimates of parameters and predictions===
It is often desired to use a posterior distribution to estimate a parameter or variable. Several methods of Bayesian estimation select [[central tendency|measurements of central tendency]] from the posterior distribution.

For one-dimensional problems, a unique median exists for practical continuous problems. The posterior median is attractive as a [[robust statistics|robust estimator]].<ref>{{cite book|title=Pitman's measure of closeness: A comparison of statistical estimators|first1=Pranab K.|last1=Sen|author-link1=Pranab K. Sen|first2=J. P.|last2=Keating|first3=R. L.|last3= Mason | publisher=SIAM|location=Philadelphia|year=1993}}</ref>

If there exists a finite mean for the posterior distribution, then the posterior mean is a method of estimation.<ref>{{Cite book| last1=Choudhuri|first1=Nidhan|last2=Ghosal|first2=Subhashis|last3=Roy|first3=Anindya|date=2005-01-01|chapter=Bayesian Methods for Function Estimation|title=Handbook of Statistics|series=Bayesian Thinking|volume=25|pages=373–414|doi= 10.1016/s0169-7161(05)25013-7 |isbn=9780444515391|citeseerx=10.1.1.324.3052}}</ref>
<math display="block">\tilde \theta = \operatorname{E}[\theta] = \int \theta \, p(\theta \mid \mathbf{X},\alpha) \, d\theta</math>

Taking a value with the greatest probability defines [[maximum a posteriori estimation|maximum ''a&nbsp;posteriori'' (MAP)]] estimates:<ref>{{Cite web|url=https://www.probabilitycourse.com/chapter9/9_1_2_MAP_estimation.php|title=Maximum A Posteriori (MAP) Estimation|website=www.probabilitycourse.com|language=en|access-date=2017-06-02}}</ref>
<math display="block">\{ \theta_{\text{MAP}}\} \subset \arg \max_\theta p(\theta \mid \mathbf{X},\alpha) .</math>

There are examples where no maximum is attained, in which case the set of MAP estimates is [[empty set|empty]].

There are other methods of estimation that minimize the posterior ''[[risk]]'' (expected-posterior loss) with respect to a [[loss function]], and these are of interest to [[statistical decision theory]] using the sampling distribution ("frequentist statistics").<ref>{{Cite web|url=http://www.cogsci.ucsd.edu/~ajyu/Teaching/Tutorials/bayes_dt.pdf|title=Introduction to Bayesian Decision Theory|last=Yu|first=Angela|website=cogsci.ucsd.edu/|archive-url=https://web.archive.org/web/20130228060536/http://www.cogsci.ucsd.edu/~ajyu/Teaching/Tutorials/bayes_dt.pdf|archive-date=2013-02-28|url-status=dead}}</ref>

The [[posterior predictive distribution]] of a new observation <math>\tilde{x}</math> (that is independent of previous observations) is determined by<ref>{{Cite web|url=http://people.stat.sc.edu/Hitchcock/stat535slidesday18.pdf|title=Posterior Predictive Distribution Stat Slide|last=Hitchcock|first=David|website=stat.sc.edu}}</ref>
<math display="block">p(\tilde{x}|\mathbf{X},\alpha) = \int p(\tilde{x},\theta \mid \mathbf{X},\alpha) \, d\theta = \int p(\tilde{x} \mid \theta) p(\theta \mid \mathbf{X},\alpha) \, d\theta .</math>