Editing Bayesian inference

{{Short description|Method of statistical inference}}
{{Bayesian statistics}}
'''Bayesian inference''' ({{IPAc-en|ˈ|b|eɪ|z|i|ə|n}} {{respell|BAY|zee|ən}} or {{IPAc-en|ˈ|b|eɪ|ʒ|ən}} {{respell|BAY|zhən}}){{refn|{{MerriamWebsterDictionary|=2023-08-12|Bayesian}}}} is a method of [[statistical inference]] in which [[Bayes' theorem]] is used to calculate a probability of a hypothesis, given prior [[evidence]], and update it as more [[information]] becomes available. Fundamentally, Bayesian inference uses a [[Prior probability|prior distribution]] to estimate [[Posterior probability|posterior probabilities.]] Bayesian inference is an important technique in [[statistics]], and especially in [[mathematical statistics]]. Bayesian updating is particularly important in the [[Sequential analysis|dynamic analysis of a sequence of data]]. Bayesian inference has found application in a wide range of activities, including [[science]], [[engineering]], [[philosophy]], [[medicine]], [[sport]], and [[law]]. In the philosophy of [[decision theory]], Bayesian inference is closely related to subjective probability, often called "[[Bayesian probability]]".
<!--
; however, non-Bayesian updating rules are compatible with rationality, according to philosophers [[Ian Hacking]] and [[Bas van&nbsp;Fraassen]].<ref>Stanford encyclopedia of philosophy; Bayesian Epistemology; http://plato.stanford.edu/entries/epistemology-bayesian</ref><ref>Gillies, Donald (2000); "Philosophical Theories of Probability"; Routledge; Chapter 4 "The subjective theory"</ref>
-->

==Introduction to Bayes' rule==
[[File:Bayes theorem visualisation.svg|thumb|upright=1.2|A geometric visualisation of Bayes' theorem. In the table, the values 2, 3, 6 and 9 give the relative weights of each corresponding condition and case. The figures denote the cells of the table involved in each metric, the probability being the fraction of each figure that is shaded. This shows that <math>P(A|B) P(B) = P(B|A) P(A)</math> i.e. <math>P(A|B) = \frac{P(B|A) P(A)}{P(B)}</math>. Similar reasoning can be used to show that <math>P(\neg A|B) = \frac{P(B|\neg A) P(\neg A)}{P(B)}</math> etc.]]
{{Main|Bayes' theorem}}
{{See also|Bayesian probability}}

===Formal explanation===
{| class="wikitable floatright" style="font-size:100%;"
|+ [[Contingency table]]
! {{diagonal split header|<br /><br />Evidence|Hypothesis}} !! Satisfies<br />hypothesis<br />{{mvar|H}} !! Violates<br />hypothesis<br />{{tmath|\neg H}} !! rowspan="5" style="padding:0;"| !! <br />Total
|-
! Has evidence<br />{{mvar|E}}
| |<math>P(H|E)\cdot P(E)</math><br /><math>= P(E|H)\cdot P(H)</math> || |<math>P(\neg H|E)\cdot P(E)</math><br /><math>= P(E|\neg H)\cdot P(\neg H)</math> || {{tmath|P(E)}}
|-
! No evidence<br />{{tmath|\neg E}}
| nowrap|<math>P(H|\neg E)\cdot P(\neg E)</math><br /><math>= P(\neg E|H)\cdot P(H)</math> || nowrap|<math>P(\neg H|\neg E)\cdot P(\neg E)</math><br /><math>= P(\neg E|\neg H)\cdot P(\neg H)</math> || nowrap|<math>P(\neg E)</math>=<br /><math>1-P(E)</math>
|-
| colspan="5" style="padding:0;"|
|-
! Total
| &nbsp;&nbsp; {{tmath|P(H)}} || style="text-align:right;" nowrap|<math>P(\neg H) = 1-P(H)</math> || style="text-align:center;"|1
|}
Bayesian inference derives the [[posterior probability]] as a [[consequence relation|consequence]] of two [[Antecedent (logic)|antecedent]]s: a [[prior probability]] and a "[[likelihood function]]" derived from a [[statistical model]] for the observed data. Bayesian inference computes the posterior probability according to [[Bayes' theorem]]:
<math display="block">P(H \mid E) = \frac{P(E \mid H) \cdot P(H)}{P(E)},</math>
where
* {{mvar|H}} stands for any ''hypothesis'' whose probability may be affected by [[Experimental data|data]] (called ''evidence'' below). Often there are competing hypotheses, and the task is to determine which is the most probable.
* <math>P(H)</math>, the ''[[prior probability]]'', is the estimate of the probability of the hypothesis {{mvar|H}} ''before'' the data {{mvar|E}}, the current evidence, is observed.
* {{mvar|E}}, the ''evidence'', corresponds to new data that were not used in computing the prior probability.
* <math>P(H \mid E)</math>, the ''[[posterior probability]]'', is the probability of {{mvar|H}} ''given'' {{mvar|E}}, i.e., ''after'' {{mvar|E}} is observed.  This is what we want to know: the probability of a hypothesis ''given'' the observed evidence.
* <math>P(E \mid H)</math> is the probability of observing {{mvar|E}} ''given'' {{mvar|H}} and is called the ''[[Likelihood function|likelihood]]''. As a function of {{mvar|E}} with {{mvar|H}} fixed, it indicates the compatibility of the evidence with the given hypothesis. The likelihood function is a function of the evidence, {{mvar|E}}, while the posterior probability is a function of the hypothesis, {{mvar|H}}.
* <math>P(E)</math> is sometimes termed the [[marginal likelihood]] or "model evidence". This factor is the same for all possible hypotheses being considered (as is evident from the fact that the hypothesis {{mvar|H}} does not appear anywhere in the symbol, unlike for all the other factors) and hence does not factor into determining the relative probabilities of different hypotheses.
*<math>P(E)>0</math> (Else one has <math>0/0</math>.)

For different values of {{mvar|H}}, only the factors <math>P(H)</math> and <math>P(E \mid H)</math>, both in the numerator, affect the value of <math>P(H \mid E)</math>{{snd}} the posterior probability of a hypothesis is proportional to its prior probability (its inherent likeliness) and the newly acquired likelihood (its compatibility with the new observed evidence).

In cases where <math>\neg H</math> ("not {{mvar|H}}"), the [[logical negation]] of {{mvar|H}}, is a valid likelihood, Bayes' rule can be rewritten as follows:
<math display="block">\begin{align}
 P(H \mid E) &= \frac{P(E \mid H) P(H)}{P(E)} \\ \\
             &= \frac{P(E \mid H) P(H)}{P(E \mid H) P(H) + P(E \mid \neg H) P(\neg H)} \\ \\
             &= \frac{1}{1 + \left(\frac{1}{P(H)} - 1\right) \frac{P(E \mid \neg H)}{P(E \mid H)} } \\
\end{align}</math>
because
<math display="block"> P(E) = P(E \mid H) P(H) + P(E \mid \neg H) P(\neg H) </math>
and
<math display="block"> P(H) + P(\neg H) = 1 .</math>  This focuses attention on the term <math display="block"> \left(\tfrac{1}{P(H)} - 1\right) \tfrac{P(E \mid \neg H)}{P(E \mid H)} .</math>  If that term is approximately 1, then the probability of the hypothesis given the evidence, <math> P(H \mid E) </math>, is about <math>\tfrac{1}{2}</math>, about 50% likely - equally likely or not likely.  If that term is very small, close to zero, then the probability of the hypothesis, given the evidence, <math> P(H \mid E) </math> is close to 1 or the conditional hypothesis is quite likely.  If that term is very large, much larger than 1, then the hypothesis, given the evidence, is quite unlikely.  If the hypothesis (without consideration of evidence) is unlikely, then <math>P(H)</math> is small (but not necessarily astronomically small) and <math>\tfrac{1}{P(H)}</math> is much larger than 1 and this term can be approximated as <math>\tfrac{P(E \mid \neg H)}{P(E \mid H) \cdot P(H)} </math> and relevant probabilities can be compared directly to each other.

One quick and easy way to remember the equation would be to use [[Conditional probability#As an axiom of probability|rule of multiplication]]:
<math display="block">P(E \cap H) = P(E \mid H) P(H) = P(H \mid E) P(E).</math>

===Alternatives to Bayesian updating===
Bayesian updating is widely used and computationally convenient. However, it is not the only updating rule that might be considered rational.

[[Ian Hacking]] noted that traditional "[[Dutch book]]" arguments did not specify Bayesian updating: they left open the possibility that non-Bayesian updating rules could avoid Dutch books. Hacking wrote:<ref>{{cite journal |last=Hacking |first=Ian |date=December 1967 |page=316 |title=Slightly More Realistic Personal Probability |journal=Philosophy of Science |doi=10.1086/288169 |volume=34 |number=4 |s2cid=14344339}}</ref> "And neither the Dutch book argument nor any other in the personalist arsenal of proofs of the probability axioms entails the dynamic assumption. Not one entails Bayesianism. So the personalist requires the dynamic assumption to be Bayesian. It is true that in consistency a personalist could abandon the Bayesian model of learning from experience. Salt could lose its savour."

Indeed, there are non-Bayesian updating rules that also avoid Dutch books (as discussed in the literature on "[[probability kinematics]]") following the publication of [[Richard C.&nbsp;Jeffrey]]'s rule, which applies Bayes' rule to the case where the evidence itself is assigned a probability.<ref>{{cite web |url=http://plato.stanford.edu/entries/bayes-theorem/ |title=Bayes' Theorem (Stanford Encyclopedia of Philosophy) |publisher=Plato.stanford.edu |access-date=2014-01-05}}</ref> The additional hypotheses needed to uniquely require Bayesian updating have been deemed to be substantial, complicated, and unsatisfactory.<ref>[[Bas van Fraassen|van Fraassen, B.]] (1989) ''Laws and Symmetry'', Oxford University Press. {{ISBN|0-19-824860-1}}.</ref>

==Inference over exclusive and exhaustive possibilities==
If evidence is simultaneously used to update belief over a set of exclusive and exhaustive propositions, Bayesian inference may be thought of as acting on this belief distribution as a whole.

===General formulation===
[[File:Bayesian inference event space.svg|thumb|Diagram illustrating event space <math>\Omega</math> in general formulation of Bayesian inference. Although this diagram shows discrete models and events, the continuous case may be visualized similarly using probability densities.]]

<!-- This section is not clear as it now stands. -->
Suppose a process is generating independent and identically distributed events <math>E_n,\ n = 1, 2, 3, \ldots</math>, but the [[probability distribution]] is unknown. Let the event space <math>\Omega</math> represent the current state of belief for this process. Each model is represented by event <math>M_m</math>. The conditional probabilities <math>P(E_n \mid M_m)</math> are specified to define the models. <math>P(M_m)</math> is the [[Credence (statistics)|degree of belief]] in <math>M_m</math>. Before the first inference step, <math>\{P(M_m)\}</math> is a set of ''initial prior probabilities''. These must sum to 1, but are otherwise arbitrary.

Suppose that the process is observed to generate <math>E \in \{E_n\}</math>. For each <math>M \in \{M_m\}</math>, the prior <math>P(M)</math> is updated to the posterior <math>P(M \mid E)</math>. From [[Bayes' theorem]]:<ref>Gelman, Andrew; Carlin, John B.; Stern, Hal S.; Dunson, David B.; Vehtari, Aki; Rubin, Donald B. (2013). ''Bayesian Data Analysis'', Third Edition. Chapman and Hall/CRC. {{ISBN|978-1-4398-4095-5}}.</ref>

<math display="block">P(M \mid E) = \frac{P(E \mid M)}{\sum_m {P(E \mid M_m) P(M_m)}} \cdot P(M).</math>

Upon observation of further evidence, this procedure may be repeated.

===Multiple observations===

For a sequence of [[independent and identically distributed]] observations <math>\mathbf{E} = (e_1, \dots, e_n)</math>, it can be shown by induction that repeated application of the above is equivalent to
<math display="block">P(M \mid \mathbf{E}) = \frac{P(\mathbf{E} \mid M)}{\sum_m {P(\mathbf{E} \mid M_m) P(M_m)}} \cdot P(M),</math>
where
<math display="block">P(\mathbf{E} \mid M) = \prod_k{P(e_k \mid M)}.</math>
<!-- It may be more informative if an actual example is given: e1/M, e2/M, ... might be shown as .05/4, .061/4, .033/4.  Then showing the actual calculations using these three terms in the summation. -->

===Parametric formulation: motivating the formal description===

By parameterizing the space of models, the belief in all models may be updated in a single step. The distribution of belief over the model space may then be thought of as a distribution of belief over the parameter space. The distributions in this section are expressed as continuous, represented by probability densities, as this is the usual situation. The technique is, however, equally applicable to discrete distributions.

Let the vector <math>\boldsymbol{\theta}</math> span the parameter space. Let the initial prior distribution over <math>\boldsymbol{\theta}</math> be <math>p(\boldsymbol{\theta} \mid \boldsymbol{\alpha})</math>, where <math>\boldsymbol{\alpha}</math> is a set of parameters to the prior itself, or ''[[Hyperparameter (Bayesian statistics)|hyperparameter]]s''. Let <math>\mathbf{E} = (e_1, \dots, e_n)</math> be a sequence of [[Independent and identically distributed random variables|independent and identically distributed]] event observations, where all <math>e_i</math> are distributed as <math>p(e \mid \boldsymbol{\theta})</math> for some <math>\boldsymbol{\theta}</math>. [[Bayes' theorem]] is applied to find the [[posterior distribution]] over <math>\boldsymbol{\theta}</math>:

<math display="block">\begin{align}
 p(\boldsymbol{\theta} \mid \mathbf{E}, \boldsymbol{\alpha}) &= \frac{p(\mathbf{E} \mid \boldsymbol{\theta}, \boldsymbol{\alpha})}{p(\mathbf{E} \mid \boldsymbol{\alpha})} \cdot p(\boldsymbol{\theta} \mid \boldsymbol{\alpha}) \\
  &= \frac{p(\mathbf{E} \mid \boldsymbol{\theta}, \boldsymbol{\alpha})}{\int p(\mathbf{E} \mid \boldsymbol{\theta}, \boldsymbol{\alpha}) p(\boldsymbol{\theta} \mid \boldsymbol{\alpha}) \, d\boldsymbol{\theta}} \cdot p(\boldsymbol{\theta} \mid \boldsymbol{\alpha}),
\end{align}</math>
where
<math display="block">p(\mathbf{E} \mid \boldsymbol{\theta}, \boldsymbol{\alpha}) = \prod_k p(e_k \mid \boldsymbol{\theta}).</math>

==Formal description of Bayesian inference==

===Definitions===
*<math>x</math>, a data point in general.  This may in fact be a [[random vector|vector]] of values.
*<math>\theta</math>, the [[parameter]] of the data point's distribution, i.e., {{nowrap|<math>x \sim p(x \mid \theta)</math>.}}  This may be a [[random vector|vector]] of parameters.
*<math>\alpha</math>, the [[Hyperparameter (Bayesian statistics)|hyperparameter]] of the parameter distribution, i.e., {{nowrap|<math>\theta \sim p(\theta \mid \alpha)</math>.}}  This may be a [[random vector|vector]] of hyperparameters.
*<math>\mathbf{X}</math> is the sample, a set of <math>n</math> observed data points, i.e., <math>x_1, \ldots, x_n</math>.
*<math>\tilde{x}</math>, a new data point whose distribution is to be predicted.

===Bayesian inference===

*The [[prior distribution]] is the distribution of the parameter(s) before any data is observed, i.e. <math>p(\theta \mid \alpha)</math> . The prior distribution might not be easily determined; in such a case, one possibility may be to use the [[Jeffreys prior]] to obtain a prior distribution before updating it with newer observations.
*The [[sampling distribution]] is the distribution of the observed data conditional on its parameters, i.e. {{nowrap|<math>p(\mathbf{X} \mid \theta)</math>.}}  This is also termed the [[likelihood function|likelihood]], especially when viewed as a function of the parameter(s), sometimes written <math>\operatorname{L}(\theta  \mid \mathbf{X}) = p(\mathbf{X} \mid \theta)</math>.
*The [[marginal likelihood]] (sometimes also termed the ''evidence'') is the distribution of the observed data [[marginal distribution|marginalized]] over the parameter(s), i.e. <math display="block">p(\mathbf{X} \mid \alpha) = \int p(\mathbf{X} \mid \theta) p(\theta \mid \alpha) d\theta.</math> It quantifies the agreement between data and expert opinion, in a geometric sense that can be made precise.<ref name="deCarvalho-Geometry">{{Cite journal |last1=de Carvalho|first1=Miguel| last2=Page| first2=Garritt| last3 = Barney| first3 = Bradley| title = On the geometry of Bayesian inference|journal=Bayesian Analysis|year=2019|volume=14 |issue=4 |pages=1013‒1036| doi=10.1214/18-BA1112|s2cid=88521802 |url = https://www.maths.ed.ac.uk/~mdecarv/papers/decarvalho2018.pdf}}</ref> If the marginal likelihood is 0 then there is no agreement between the data and expert opinion and Bayes' rule cannot be applied.
*The [[posterior distribution]] is the distribution of the parameter(s) after taking into account the observed data.  This is determined by [[Bayes' rule]], which forms the heart of Bayesian inference: <math display="block">p(\theta \mid \mathbf{X},\alpha) = \frac{p(\theta,\mathbf{X},\alpha)}{p(\mathbf{X},\alpha)} = \frac{p(\mathbf{X}\mid\theta,\alpha)p(\theta,\alpha)}{p(\mathbf{X}\mid\alpha)p(\alpha)}
= \frac{p(\mathbf{X} \mid \theta,\alpha) p(\theta \mid \alpha)}{p(\mathbf{X} \mid \alpha)} \propto p(\mathbf{X} \mid \theta,\alpha) p(\theta \mid \alpha).</math> This is expressed in words as "posterior is proportional to likelihood times prior", or sometimes as "posterior = likelihood times prior, over evidence".
* In practice, for almost all complex Bayesian models used in machine learning, the posterior distribution <math>p(\theta \mid \mathbf{X},\alpha)</math> is not obtained in a closed form distribution, mainly because the parameter space for <math>\theta</math> can be very high, or the Bayesian model retains certain hierarchical structure formulated from the observations <math>\mathbf{X}</math> and parameter <math>\theta</math>. In such situations, we need to resort to approximation techniques.<ref name="Lee-GibbsSampler">{{Cite journal |last=Lee|first=Se Yoon|  title = Gibbs sampler and coordinate ascent variational inference: A set-theoretical review|journal=Communications in Statistics – Theory and Methods|year=2021|volume=51 |issue=6 |pages=1549–1568| doi=10.1080/03610926.2021.1921214|arxiv=2008.01006|s2cid=220935477}}</ref>
* General case: Let <math>P_Y^x </math> be the conditional distribution of <math>Y</math> given <math>X = x</math> and let <math>P_X</math> be the distribution of <math>X</math>. The joint distribution is then <math>P_{X,Y} (dx,dy) = P_Y^x (dy) P_X (dx)</math>. The conditional distribution <math>P_X^y </math> of <math>X</math>  given <math>Y=y</math> is then determined by
<math display="block">P_X^y (A) = E (1_A (X) | Y = y)</math>Existence and uniqueness of the needed [[conditional expectation]] is a consequence of the [[Radon–Nikodym theorem]]. This was formulated by [[Andrey Kolmogorov|Kolmogorov]] in his famous book from 1933. Kolmogorov underlines the importance of conditional probability by writing "I wish to call attention to  ... and especially the theory of conditional probabilities and conditional expectations ..." in the Preface.<ref>{{Cite book |last=Kolmogorov |first=A.N. |title=Foundations of the Theory of Probability |publisher=Chelsea Publishing Company |year=1933 |orig-year=1956}}</ref> The Bayes theorem determines the posterior distribution from the prior distribution. Uniqueness requires continuity assumptions.<ref>{{Cite book |last=Tjur |first=Tue |url=http://archive.org/details/probabilitybased0000tjur |title=Probability based on Radon measures |date=1980 |publisher=Chichester [Eng.] ; New York : Wiley |others=Internet Archive |isbn=978-0-471-27824-5}}</ref> Bayes' theorem can be generalized to include improper prior distributions such as the uniform distribution on the real line.<ref>{{Cite journal |last1=Taraldsen |first1=Gunnar |last2=Tufto |first2=Jarle |last3=Lindqvist |first3=Bo H. |date=2021-07-24 |title=Improper priors and improper posteriors |journal=Scandinavian Journal of Statistics |language=en |volume=49 |issue=3 |pages=969–991 |doi=10.1111/sjos.12550 |issn=0303-6898 |s2cid=237736986 |doi-access=free |hdl-access=free |hdl=11250/2984409}}</ref> Modern [[Markov chain Monte Carlo]] methods have boosted the importance of Bayes' theorem including cases with improper priors.<ref>{{Cite book |last1=Robert |first1=Christian P. |url=http://worldcat.org/oclc/1159112760 |title=Monte Carlo Statistical Methods |last2=Casella |first2=George |publisher=Springer |year=2004 |isbn=978-1475741452 |oclc=1159112760}}</ref>

===Bayesian prediction===

*The [[posterior predictive distribution]] is the distribution of a new data point, marginalized over the posterior: <math display="block">p(\tilde{x} \mid \mathbf{X},\alpha) = \int p(\tilde{x} \mid \theta) p(\theta \mid \mathbf{X},\alpha) d\theta</math>
*The [[prior predictive distribution]] is the distribution of a new data point, marginalized over the prior: <math display="block">p(\tilde{x} \mid \alpha) = \int p(\tilde{x} \mid \theta) p(\theta \mid \alpha) d\theta</math>

Bayesian theory calls for the use of the posterior predictive distribution to do [[predictive inference]], i.e., to [[prediction|predict]] the distribution of a new, unobserved data point. That is, instead of a fixed point as a prediction, a distribution over possible points is returned.  Only this way is the entire posterior distribution of the parameter(s) used.  By comparison, prediction in [[frequentist statistics]] often involves finding an optimum point estimate of the parameter(s)—e.g., by [[maximum likelihood]] or [[maximum a posteriori estimation]] (MAP)—and then plugging this estimate into the formula for the distribution of a data point. This has the disadvantage that it does not account for any uncertainty in the value of the parameter, and hence will underestimate the [[variance]] of the predictive distribution.

In some instances, frequentist statistics can work around this problem. For example, [[confidence interval]]s and [[prediction interval]]s in frequentist statistics when constructed from a [[normal distribution]] with unknown [[mean]] and [[variance]] are constructed using a [[Student's t-distribution]].  This correctly estimates the variance, due to the facts that (1)&nbsp;the average of normally distributed random variables is also normally distributed, and (2) the predictive distribution of a normally distributed data point with unknown mean and variance, using conjugate or uninformative priors, has a Student's t-distribution. In Bayesian statistics, however, the posterior predictive distribution can always be determined exactly—or at least to an arbitrary level of precision when numerical methods are used.

Both types of predictive distributions have the form of a [[compound probability distribution]] (as does the [[marginal likelihood]]). In fact, if the prior distribution is a [[conjugate prior]], such that the prior and posterior distributions come from the same family, it can be seen that both prior and posterior predictive distributions also come from the same family of compound distributions. The only difference is that the posterior predictive distribution uses the updated values of the hyperparameters (applying the Bayesian update rules given in the [[conjugate prior]] article), while the prior predictive distribution uses the values of the hyperparameters that appear in the prior distribution.



==Mathematical properties==
{{More footnotes needed|section|date=February 2012}}

===Interpretation of factor===

<math display="inline"> \frac{P(E \mid M)}{P(E)} > 1 \Rightarrow P(E \mid M) > P(E)</math>. That is, if the model were true, the evidence would be more likely than is predicted by the current state of belief. The reverse applies for a decrease in belief. If the belief does not change, <math display="inline"> \frac{P(E \mid M)}{P(E)} = 1 \Rightarrow P(E \mid M) = P(E)</math>. That is, the evidence is independent of the model. If the model were true, the evidence would be exactly as likely as predicted by the current state of belief.

===Cromwell's rule===

{{Main|Cromwell's rule}}

If <math>P(M) = 0</math> then <math>P(M \mid E) = 0</math>. If <math>P(M) = 1</math> and <math>P(E) > 0</math>, then <math>P(M|E) = 1</math>. This can be interpreted to mean that hard convictions are insensitive to counter-evidence.

The former follows directly from Bayes' theorem. The latter can be derived by applying the first rule to the event "not <math>M</math>" in place of "<math>M</math>", yielding "if <math>1 - P(M) = 0</math>, then <math>1 - P(M \mid E) = 0</math>", from which the result immediately follows.

===Asymptotic behaviour of posterior===

Consider the behaviour of a belief distribution as it is updated a large number of times with [[independent and identically distributed]] trials. For sufficiently nice prior probabilities, the [[Bernstein–von Mises theorem|Bernstein-von Mises theorem]] gives that in the limit of infinite trials, the posterior converges to a [[Gaussian distribution]] independent of the initial prior under some conditions firstly outlined and rigorously proven by [[Joseph L. Doob]] in 1948, namely if the random variable in consideration has a finite [[probability space]]. The more general results were obtained later by the statistician [[David A. Freedman (statistician)|David A. Freedman]] who published in two seminal research papers in 1963 <ref>{{cite journal| last1=Freedman|first1=DA|title=On the asymptotic behavior of Bayes' estimates in the discrete case|journal=The Annals of Mathematical Statistics|volume=34|issue=4|date=1963|pages=1386–1403|jstor=2238346|doi=10.1214/aoms/1177703871|doi-access=free}}</ref> and 1965 <ref>{{cite journal|last1=Freedman|first1=DA|title=On the asymptotic behavior of Bayes estimates in the discrete case II|journal=The Annals of Mathematical Statistics|date=1965|volume=36|issue=2|pages=454–456|jstor=2238150|doi=10.1214/aoms/1177700155|doi-access=free}}</ref> when and under what circumstances the asymptotic behaviour of posterior is guaranteed. His 1963 paper treats, like Doob (1949), the finite case and comes to a satisfactory conclusion. However, if the random variable has an infinite but countable [[probability space]] (i.e., corresponding to a die with infinite many faces) the 1965 paper demonstrates that for a dense subset of priors the [[Bernstein–von Mises theorem|Bernstein-von Mises theorem]] is not applicable. In this case there is [[almost surely]] no asymptotic convergence. Later in the 1980s and 1990s [[David A. Freedman (statistician)|Freedman]] and [[Persi Diaconis]] continued to work on the case of infinite countable probability spaces.<ref>{{cite journal|first2=Larry|last2= Wasserman |first1 = James|last1 =Robins|journal =   Journal of the American Statistical Association|date = 2000|title = Conditioning, likelihood, and coherence: A review of some foundational concepts|doi=10.1080/01621459.2000.10474344|volume=95|issue=452| pages=1340–1346|s2cid= 120767108 }}</ref> To summarise, there may be insufficient trials to suppress the effects of the initial choice, and especially for large (but finite) systems the convergence might be very slow.

===Conjugate priors===
{{Main|Conjugate prior}}

In parameterized form, the prior distribution is often assumed to come from a family of distributions called [[conjugate prior]]s. The usefulness of a conjugate prior is that the corresponding posterior distribution will be in the same family, and the calculation may be expressed in [[Closed-form expression|closed form]].

===Estimates of parameters and predictions===
It is often desired to use a posterior distribution to estimate a parameter or variable. Several methods of Bayesian estimation select [[central tendency|measurements of central tendency]] from the posterior distribution.

For one-dimensional problems, a unique median exists for practical continuous problems. The posterior median is attractive as a [[robust statistics|robust estimator]].<ref>{{cite book|title=Pitman's measure of closeness: A comparison of statistical estimators|first1=Pranab K.|last1=Sen|author-link1=Pranab K. Sen|first2=J. P.|last2=Keating|first3=R. L.|last3= Mason | publisher=SIAM|location=Philadelphia|year=1993}}</ref>

If there exists a finite mean for the posterior distribution, then the posterior mean is a method of estimation.<ref>{{Cite book| last1=Choudhuri|first1=Nidhan|last2=Ghosal|first2=Subhashis|last3=Roy|first3=Anindya|date=2005-01-01|chapter=Bayesian Methods for Function Estimation|title=Handbook of Statistics|series=Bayesian Thinking|volume=25|pages=373–414|doi= 10.1016/s0169-7161(05)25013-7 |isbn=9780444515391|citeseerx=10.1.1.324.3052}}</ref>
<math display="block">\tilde \theta = \operatorname{E}[\theta] = \int \theta \, p(\theta \mid \mathbf{X},\alpha) \, d\theta</math>

Taking a value with the greatest probability defines [[maximum a posteriori estimation|maximum ''a&nbsp;posteriori'' (MAP)]] estimates:<ref>{{Cite web|url=https://www.probabilitycourse.com/chapter9/9_1_2_MAP_estimation.php|title=Maximum A Posteriori (MAP) Estimation|website=www.probabilitycourse.com|language=en|access-date=2017-06-02}}</ref>
<math display="block">\{ \theta_{\text{MAP}}\} \subset \arg \max_\theta p(\theta \mid \mathbf{X},\alpha) .</math>

There are examples where no maximum is attained, in which case the set of MAP estimates is [[empty set|empty]].

There are other methods of estimation that minimize the posterior ''[[risk]]'' (expected-posterior loss) with respect to a [[loss function]], and these are of interest to [[statistical decision theory]] using the sampling distribution ("frequentist statistics").<ref>{{Cite web|url=http://www.cogsci.ucsd.edu/~ajyu/Teaching/Tutorials/bayes_dt.pdf|title=Introduction to Bayesian Decision Theory|last=Yu|first=Angela|website=cogsci.ucsd.edu/|archive-url=https://web.archive.org/web/20130228060536/http://www.cogsci.ucsd.edu/~ajyu/Teaching/Tutorials/bayes_dt.pdf|archive-date=2013-02-28|url-status=dead}}</ref>

The [[posterior predictive distribution]] of a new observation <math>\tilde{x}</math> (that is independent of previous observations) is determined by<ref>{{Cite web|url=http://people.stat.sc.edu/Hitchcock/stat535slidesday18.pdf|title=Posterior Predictive Distribution Stat Slide|last=Hitchcock|first=David|website=stat.sc.edu}}</ref>
<math display="block">p(\tilde{x}|\mathbf{X},\alpha) = \int p(\tilde{x},\theta \mid \mathbf{X},\alpha) \, d\theta = \int p(\tilde{x} \mid \theta) p(\theta \mid \mathbf{X},\alpha) \, d\theta .</math>

==Examples==

===Probability of a hypothesis===
{| class="wikitable floatright" style="font-size:100%;"
|+ [[Contingency table]]
! {{diagonal split header|<br />Cookie|Bowl}}
! #1<br />''H''<sub>1</sub> !! #2<br />''H''<sub>2</sub> !! rowspan="4" style="padding:0;"| !! <br />Total
|-
! Plain, ''E''
| '''30''' || 20 || '''50'''
|-
! Choc, ¬''E''
| 10 || 20 || 30
|-
! Total 
| 40 || 40 || 80
|-
| colspan="5"|''P''(''H''<sub>1</sub>|''E'') = 30 / 50 = 0.6
|}
Suppose there are two full bowls of cookies. Bowl #1 has 10 chocolate chip and 30 plain cookies, while bowl #2 has 20 of each. Our friend Fred picks a bowl at random, and then picks a cookie at random. We may assume there is no reason to believe Fred treats one bowl differently from another, likewise for the cookies. The cookie turns out to be a plain one. How probable is it that Fred picked it out of bowl #1?

Intuitively, it seems clear that the answer should be more than a half, since there are more plain cookies in bowl #1. The precise answer is given by Bayes' theorem. Let <math>H_1</math> correspond to bowl #1, and <math>H_2</math> to bowl #2.
It is given that the bowls are identical from Fred's point of view, thus <math>P(H_1)=P(H_2)</math>, and the two must add up to 1, so both are equal to 0.5.
The event <math>E</math> is the observation of a plain cookie. From the contents of the bowls, we know that <math>P(E \mid H_1) = 30/40 = 0.75</math> and <math>P(E \mid H_2) = 20/40 = 0.5.</math> Bayes' formula then yields
<math display="block">\begin{align}
P(H_1 \mid E) &= \frac{P(E \mid H_1)\,P(H_1)}{P(E \mid H_1)\,P(H_1)\;+\;P(E \mid H_2)\,P(H_2)} \\
 \\
 \ & = \frac{0.75 \times 0.5}{0.75 \times 0.5 + 0.5 \times 0.5} \\
 \\
 \ & = 0.6
\end{align}</math>

Before we observed the cookie, the probability we assigned for Fred having chosen bowl #1 was the prior probability, <math>P(H_1)</math>, which was 0.5. After observing the cookie, we must revise the probability to <math>P(H_1 \mid E)</math>, which is 0.6.

===Making a prediction===
[[File:Bayesian inference archaeology example.jpg|thumb|Example results for archaeology example. This simulation was generated using c=15.2.]]

An archaeologist is working at a site thought to be from the medieval period, between the 11th century to the 16th century. However, it is uncertain exactly when in this period the site was inhabited. Fragments of pottery are found, some of which are glazed and some of which are decorated. It is expected that if the site were inhabited during the early medieval period, then 1% of the pottery would be glazed and 50% of its area decorated, whereas if it had been inhabited in the late medieval period then 81% would be glazed and 5% of its area decorated. How confident can the archaeologist be in the date of inhabitation as fragments are unearthed?

The degree of belief in the continuous variable <math>C</math> (century) is to be calculated, with the discrete set of events <math>\{GD,G \bar D, \bar G D, \bar G \bar D\}</math> as evidence. Assuming linear variation of glaze and decoration with time, and that these variables are independent,

<math display="block">P(E=GD \mid C=c) = (0.01 + \frac{0.81-0.01}{16-11}(c-11))(0.5 - \frac{0.5-0.05}{16-11}(c-11))</math>
<math display="block">P(E=G \bar D \mid C=c) = (0.01 + \frac{0.81-0.01}{16-11}(c-11))(0.5 + \frac{0.5-0.05}{16-11}(c-11))</math>
<math display="block">P(E=\bar G D \mid C=c) = ((1-0.01) - \frac{0.81-0.01}{16-11}(c-11))(0.5 - \frac{0.5-0.05}{16-11}(c-11))</math>
<math display="block">P(E=\bar G \bar D \mid C=c) = ((1-0.01) - \frac{0.81-0.01}{16-11}(c-11))(0.5 + \frac{0.5-0.05}{16-11}(c-11))</math>

Assume a uniform prior of <math display="inline"> f_C(c) = 0.2</math>, and that trials are [[independent and identically distributed]]. When a new fragment of type <math>e</math> is discovered, Bayes' theorem is applied to update the degree of belief for each <math>c</math>:
<math display="block">f_C(c \mid E=e) = \frac{P(E=e \mid C=c)}{P(E=e)}f_C(c) = \frac{P(E=e \mid C=c)}{\int_{11}^{16}{P(E=e \mid C=c)f_C(c)dc}}f_C(c)</math>

A computer simulation of the changing belief as 50 fragments are unearthed is shown on the graph. In the simulation, the site was inhabited around 1420, or <math>c=15.2</math>. By calculating the area under the relevant portion of the graph for 50 trials, the archaeologist can say that there is practically no chance the site was inhabited in the 11th and 12th centuries, about 1% chance that it was inhabited during the 13th century, 63% chance during the 14th century and 36% during the 15th century. The [[Bernstein–von Mises theorem|Bernstein-von Mises theorem]] asserts here the asymptotic convergence to the "true" distribution because the [[probability space]] corresponding to the discrete set of events <math>\{GD,G \bar D, \bar G D, \bar G \bar D\}</math> is finite (see above section on asymptotic behaviour of the posterior).

==In frequentist statistics and decision theory==

A [[statistical decision theory|decision-theoretic]] justification of the use of Bayesian inference was given by [[Abraham Wald]], who proved that every unique Bayesian procedure is [[admissible decision rule|admissible]]. Conversely, every [[admissible decision rule|admissible]] statistical procedure is either a Bayesian procedure or a limit of Bayesian procedures.<ref name="Bickel & Doksum 2001, page 32">Bickel & Doksum (2001, p. 32)</ref>

Wald characterized admissible procedures as Bayesian procedures (and limits of Bayesian procedures), making the Bayesian formalism a central technique in such areas of [[frequentist inference]] as [[parameter estimation]], [[hypothesis testing]], and computing [[confidence intervals]].<ref>{{cite journal|doi=10.1214/aoms/1177700051|author=Kiefer, J. |author2=Schwartz R. |title=Admissible Bayes Character of T<sup>2</sup>-, R<sup>2</sup>-, and Other Fully Invariant Tests for Multivariate Normal Problems|journal=Annals of Mathematical Statistics| volume=36|issue=3 |year=1965|pages=747–770|author-link=Jack Kiefer (mathematician) |doi-access=free}}</ref><ref>{{cite journal |doi= 10.1214/aoms/1177697822| author=Schwartz, R.|title=Invariant Proper Bayes Tests for Exponential Families |journal=Annals of Mathematical Statistics| volume=40 |year=1969| pages=270–283|doi-access=free}}</ref><ref>{{cite journal|doi=10.1214/aos/1176345877|author1=Hwang, J. T.  |author2=Casella, George  |name-list-style=amp |title=Minimax Confidence Sets for the Mean of a Multivariate Normal Distribution|journal=Annals of Statistics| volume=10|issue=3 | pages=868–881|year=1982|url= http://ecommons.cornell.edu/bitstream/1813/32852/1/BU-750-M.pdf|doi-access=free}}</ref> For example:
* "Under some conditions, all admissible procedures are either Bayes procedures or limits of Bayes procedures (in various senses). These remarkable results, at least in their original form, are due essentially to Wald. They are useful because the property of being Bayes is easier to analyze than admissibility."<ref name="Bickel & Doksum 2001, page 32"/>
* "In decision theory, a quite general method for proving admissibility consists in exhibiting a procedure as a unique Bayes solution."<ref>{{cite book|author=Lehmann, Erich| title=Testing Statistical Hypotheses|edition=Second|year=1986| author-link=Erich Leo Lehmann}} (see p. 309 of Chapter 6.7 "Admissibility", and pp. 17–18 of Chapter 1.8 "Complete Classes"</ref>
*"In the first chapters of this work, prior distributions with finite support and the corresponding Bayes procedures were used to establish some of the main theorems relating to the comparison of experiments. Bayes procedures with respect to more general prior distributions have played a very important role in the development of statistics, including its asymptotic theory." "There are many problems where a glance at posterior distributions, for suitable priors, yields immediately interesting information. Also, this technique can hardly be avoided in sequential analysis."<ref>{{cite book|last=Le Cam|first= Lucien|title=Asymptotic Methods in Statistical Decision Theory|year=1986|publisher=Springer-Verlag | isbn=978-0-387-96307-5|author-link=Lucien Le Cam}} (From "Chapter 12 Posterior Distributions and Bayes Solutions", p. 324)</ref>
*"A useful fact is that any Bayes decision rule obtained by taking a proper prior over the whole parameter space must be admissible"<ref>{{cite book |last1=Cox | first1 = D. R. | last2=Hinkley | first2 = D.V. |title=Theoretical Statistics |year=1974 | publisher=Chapman and Hall |isbn=978-0-04-121537-3 |page = 432 |author-link=David R. Cox }}</ref>
*"An important area of investigation in the development of admissibility ideas has been that of conventional sampling-theory procedures, and many interesting results have been obtained."<ref>{{cite book|last1=Cox | first1 = D. R. | last2=Hinkley | first2 = D.V. | title=Theoretical Statistics|year=1974 |publisher=Chapman and Hall|isbn=978-0-04-121537-3|page = 433|author-link=David R. Cox }})</ref>

===Model selection===
{{main|Bayesian model selection}}
{{see also|Bayesian information criterion}}
Bayesian methodology also plays a role in [[model selection]] where the aim is to select one model from a set of competing models that represents most closely the underlying process that generated the observed data. In Bayesian model comparison, the model with the highest [[posterior probability]] given the data is selected. The posterior probability of a model depends on the evidence, or [[marginal likelihood]], which reflects the probability that the data is generated by the model, and on the [[Prior probability|prior belief]] of the model. When two competing models are a priori considered to be equiprobable, the ratio of their posterior probabilities corresponds to the [[Bayes factor]]. Since Bayesian model comparison is aimed on selecting the model with the highest posterior probability, this methodology is also referred to as the maximum a posteriori (MAP) selection rule <ref>{{cite journal|first1= P.|last1= Stoica |first2 = Y.|last2 =Selen|journal = IEEE Signal Processing Magazine |date = 2004| title = A review of information criterion rules|doi=10.1109/MSP.2004.1311138|volume=21|issue=4|pages=36–47|s2cid= 17338979 }}</ref> or the MAP probability rule.<ref>{{cite journal|first1= J.|last1= Fatermans |first2 = S.|last2 =Van Aert |first3=A.J. |last3=den Dekker|journal = Ultramicroscopy |date = 2019|title = The maximum a posteriori probability rule for atom column detection from HAADF STEM images|doi=10.1016/j.ultramic.2019.02.003|volume=201|pages=81–91|pmid= 30991277 |arxiv=1902.05809| s2cid= 104419861 }}</ref>

==Probabilistic programming==
{{main|Probabilistic programming}}

While conceptually simple, Bayesian methods can be mathematically and numerically challenging. Probabilistic programming languages (PPLs) implement functions to easily build Bayesian models together with efficient automatic inference methods. This helps separate the model building from the inference, allowing practitioners to focus on their specific problems and leaving PPLs to handle the computational details for them.<ref>Bessiere, P., Mazer, E., Ahuactzin, J. M., & Mekhnacha, K. (2013). Bayesian Programming (1 edition) Chapman and Hall/CRC.</ref><ref>{{cite journal|author=Daniel Roy|date=2015|title=Probabilistic Programming|website=probabilistic-programming.org|url=http://probabilistic-programming.org/wiki/Home|access-date=2020-01-02| archive-date=2016-01-10|archive-url=https://web.archive.org/web/20160110035042/http://probabilistic-programming.org/wiki/Home| url-status=dead}}</ref><ref>{{cite journal | last1 = Ghahramani | first1 = Z | year = 2015 | title = Probabilistic machine learning and artificial intelligence | url = https://www.repository.cam.ac.uk/handle/1810/248538| journal = Nature | volume = 521 | issue = 7553| pages = 452–459 | doi = 10.1038/nature14541 | pmid = 26017444 | bibcode = 2015Natur.521..452G | s2cid = 216356 }}</ref>

==Applications==

===Statistical data analysis===

See the separate Wikipedia entry on [[Bayesian statistics]], specifically the [[Bayesian statistics#Statistical modeling|statistical modeling]] section in that page.

===Computer applications===
Bayesian inference has applications in [[artificial intelligence]] and [[expert system]]s.  Bayesian inference techniques have been a fundamental part of computerized [[pattern recognition]] techniques since the late 1950s.<ref>{{cite journal |last1=Fienberg |first1=Stephen E. |title=When did Bayesian inference become "Bayesian"? |journal=Bayesian Analysis |date=2006-03-01 |volume=1 |issue=1 |doi=10.1214/06-BA101|doi-access=free }}</ref> There is also an ever-growing connection between Bayesian methods and simulation-based [[Monte Carlo method|Monte Carlo]] techniques since complex models cannot be processed in closed form by a Bayesian analysis, while a [[graphical model]] structure ''may'' allow for efficient simulation algorithms like the [[Gibbs sampling]] and other [[Metropolis–Hastings algorithm]] schemes.<ref>{{cite book|author=Jim Albert|year=2009|title= Bayesian Computation with R, Second edition|publisher=Springer|location=New York, Dordrecht, etc.|isbn= 978-0-387-92297-3}}</ref> Recently{{when|date=September 2018}} Bayesian inference has gained popularity among the [[phylogenetics]] community for these reasons; a number of applications allow many demographic and evolutionary parameters to be estimated simultaneously.

As applied to [[statistical classification]], Bayesian inference has been used to develop algorithms for identifying [[e-mail spam]]. Applications which make use of Bayesian inference for spam filtering include [[CRM114 (program)|CRM114]], [[DSPAM]], [[Bogofilter]], [[SpamAssassin]], [[SpamBayes]], [[Mozilla]], XEAMS, and others. Spam classification is treated in more detail in the article on the [[naïve Bayes classifier]].

[[Solomonoff's theory of inductive inference|Solomonoff's Inductive inference]] is the theory of prediction based on observations; for example, predicting the next symbol based upon a given series of symbols. The only assumption is that the environment follows some unknown but computable [[probability distribution]]. It is a formal inductive framework that combines two well-studied principles of inductive inference: Bayesian statistics and [[Occam's Razor]].<ref>{{cite journal |doi= 10.3390/e13061076 |arxiv=1105.5721 |bibcode=2011Entrp..13.1076R |title=A Philosophical Treatise of Universal Induction| journal=Entropy|volume=13 |issue=6|pages=1076–1136|year=2011|last1=Rathmanner|first1=Samuel|last2=Hutter|first2=Marcus| last3=Ormerod|first3=Thomas C|s2cid=2499910 |doi-access=free }}</ref>{{rs inline|date=September 2018}} Solomonoff's universal prior probability of any prefix ''p'' of a computable sequence ''x'' is the sum of the probabilities of all programs (for a universal computer) that compute something starting with ''p''. Given some ''p'' and any computable but unknown probability distribution from which ''x'' is sampled, the universal prior and Bayes' theorem can be used to predict the yet unseen parts of ''x'' in optimal fashion.<ref>{{Cite journal |bibcode = 2007arXiv0709.1516H |title = On Universal Prediction and Bayesian Confirmation|journal = Theoretical Computer Science |volume = 384 |issue = 2007|pages = 33–48|last1 = Hutter|first1 = Marcus|last2 = He|first2 = Yang-Hui|last3 = Ormerod|first3 = Thomas C|year = 2007|arxiv = 0709.1516|doi = 10.1016/j.tcs.2007.05.016 |s2cid = 1500830}}</ref><ref>{{Cite CiteSeerX  |last1=Gács |first1=Peter |last2=Vitányi |first2=Paul M. B. |date=2 December 2010 |title=Raymond J. Solomonoff 1926-2009 |citeseerx=10.1.1.186.8268  }}</ref>

===Bioinformatics and healthcare applications===
Bayesian inference has been applied in different [[Bioinformatics]] applications, including differential gene expression analysis.<ref name=":edgr">Robinson, Mark D & McCarthy, Davis J & Smyth, Gordon K edgeR: a Bioconductor package for differential expression analysis of digital gene expression data, Bioinformatics.</ref> Bayesian inference is also used in a general cancer risk model, called [[Continuous Individualized Risk Index|CIRI]] (Continuous Individualized Risk Index), where serial measurements are incorporated to update a Bayesian model which is primarily built from prior knowledge.<ref>{{Cite web| url=https://ciri.stanford.edu/|title=CIRI|website=ciri.stanford.edu|access-date=2019-08-11}}</ref><ref>{{Cite journal| last1=Kurtz |first1=David M.|last2=Esfahani|first2=Mohammad S.|last3=Scherer|first3=Florian|last4=Soo|first4=Joanne| last5=Jin| first5=Michael C.|last6=Liu|first6=Chih Long|last7=Newman|first7=Aaron M.|last8=Dührsen|first8=Ulrich| last9=Hüttmann | first9=Andreas | date=2019-07-25|title=Dynamic Risk Profiling Using Serial Tumor Biomarkers for Personalized Outcome Prediction | journal=Cell|volume=178|issue=3|pages=699–713.e19|doi=10.1016/j.cell.2019.06.011|issn=1097-4172|pmid=31280963|pmc=7380118|doi-access=free}}</ref>

===In the courtroom===
{{Main|Jurimetrics#Bayesian analysis of evidence}}
Bayesian inference can be used by jurors to coherently accumulate the evidence for and against a defendant, and to see whether, in totality, it meets their personal threshold for "[[beyond a reasonable doubt]]".<ref>Dawid, A.&nbsp;P. and Mortera,&nbsp;J. (1996) "Coherent Analysis of Forensic Identification Evidence". ''[[Journal of the Royal Statistical Society]]'', Series&nbsp;B, 58, 425–443.</ref><ref>
Foreman, L.&nbsp;A.; Smith, A.&nbsp;F.&nbsp;M., and Evett, I.&nbsp;W. (1997). "Bayesian analysis of deoxyribonucleic acid profiling data in forensic identification applications (with discussion)". ''Journal of the Royal Statistical Society'', Series&nbsp;A, 160, 429–469.</ref><ref>Robertson, B. and Vignaux, G.&nbsp;A. (1995) ''Interpreting Evidence: Evaluating Forensic Science in the Courtroom''. John Wiley and Sons. Chichester. {{ISBN|978-0-471-96026-3}}.</ref> Bayes' theorem is applied successively to all evidence presented, with the posterior from one stage becoming the prior for the next. The benefit of a Bayesian approach is that it gives the juror an unbiased, rational mechanism for combining evidence. It may be appropriate to explain Bayes' theorem to jurors in [[Bayes' rule|odds form]], as [[betting odds]] are more widely understood than probabilities. Alternatively, a [[Gambling and information theory|logarithmic approach]], replacing multiplication with addition, might be easier for a jury to handle.

[[Image:Ebits2c.png|thumb|right|Adding up evidence]]

If the existence of the crime is not in doubt, only the identity of the culprit, it has been suggested that the prior should be uniform over the qualifying population.<ref>Dawid, A. P. (2001) [http://128.40.111.250/evidence/content/dawid-paper.pdf Bayes' Theorem and Weighing Evidence by Juries]. {{Webarchive|url=https://web.archive.org/web/20150701112146/http://128.40.111.250/evidence/content/dawid-paper.pdf |date=2015-07-01. }}</ref> For example, if 1,000 people could have committed the crime, the prior probability of guilt would be 1/1000.

The use of Bayes' theorem by jurors is controversial. In the United Kingdom, a defence [[expert witness]] explained Bayes' theorem to the jury in ''[[Regina versus Denis John Adams|R v Adams]]''. The jury convicted, but the case went to appeal on the basis that no means of accumulating evidence had been provided for jurors who did not wish to use Bayes' theorem. The Court of Appeal upheld the conviction, but it also gave the opinion that "To introduce Bayes' Theorem, or any similar method, into a criminal trial plunges the jury into inappropriate and unnecessary realms of theory and complexity, deflecting them from their proper task."

Gardner-Medwin<ref>Gardner-Medwin, A. (2005) "What Probability Should the Jury Address?". ''[[Significance (journal)|Significance]]'', 2 (1), March 2005.</ref> argues that the criterion on which a verdict in a criminal trial should be based is ''not'' the probability of guilt, but rather the ''probability of the evidence, given that the defendant is innocent'' (akin to a [[frequentist]] [[p-value]]). He argues that if the posterior probability of guilt is to be computed by Bayes' theorem, the prior probability of guilt must be known. This will depend on the incidence of the crime, which is an unusual piece of evidence to consider in a criminal trial. Consider the following three propositions:
: ''A'' – the known facts and testimony could have arisen if the defendant is guilty.
: ''B'' – the known facts and testimony could have arisen if the defendant is innocent.
: ''C'' – the defendant is guilty.

Gardner-Medwin argues that the jury should believe both ''A'' and not-''B'' in order to convict. ''A'' and not-''B'' implies the truth of ''C'', but the reverse is not true. It is possible that ''B'' and ''C'' are both true, but in this case he argues that a jury should acquit, even though they know that they will be letting some guilty people go free. See also [[Lindley's paradox]].

===Bayesian epistemology===
[[Bayesian epistemology]] is a movement that advocates for Bayesian inference as a means of justifying the rules of inductive logic.

[[Karl Popper]] and [[David Miller (philosopher)|David Miller]] have rejected the idea of Bayesian rationalism, i.e. using Bayes rule to make epistemological inferences:<ref>{{cite book|first =David |last =Miller|title = Critical Rationalism|url = https://books.google.com/books?id=bh_yCgAAQBAJ|isbn = 978-0-8126-9197-9|year = 1994|publisher = Open Court|location = Chicago}}</ref> It is prone to the same [[vicious circle]] as any other [[justificationism|justificationist]] epistemology, because it presupposes what it attempts to justify. According to this view, a rational interpretation of Bayesian inference would see it merely as a probabilistic version of [[falsifiability|falsification]], rejecting the belief, commonly held by Bayesians, that high likelihood achieved by a series of Bayesian updates would prove the hypothesis beyond any reasonable doubt, or even with likelihood greater than 0.

===Other===
* The [[scientific method]] is sometimes interpreted as an application of Bayesian inference. In this view, Bayes' rule guides (or should guide) the updating of probabilities about [[hypothesis|hypotheses]] conditional on new observations or [[experiment]]s.<ref>Howson & Urbach (2005), Jaynes (2003)</ref> The Bayesian inference has also been applied to treat [[stochastic scheduling]] problems with incomplete information by Cai et al. (2009).<ref name="Cai et al. 2009">{{cite journal| last1=Cai|first1=X.Q.|last2=Wu|first2=X.Y.|last3=Zhou|first3=X.|title=Stochastic scheduling subject to breakdown-repeat breakdowns with incomplete information|journal=Operations Research|date=2009|volume=57| issue=5|pages=1236–1249| doi=10.1287/opre.1080.0660}}</ref>
* [[Bayesian search theory]] is used to search for lost objects.
* [[Bayesian inference in phylogeny]]
* [[Bayesian tool for methylation analysis]]
* [[Bayesian approaches to brain function]]  investigate the brain as a Bayesian mechanism.
* Bayesian inference in ecological studies<ref>{{Cite journal|last1=Ogle|first1=Kiona|last2=Tucker|first2=Colin| last3=Cable| first3=Jessica M.|date=2014-01-01|title=Beyond simple linear mixing models: process-based isotope partitioning of ecological processes|journal=Ecological Applications|language=en|volume=24|issue=1| pages=181–195|doi=10.1890/1051-0761-24.1.181 | pmid=24640543 |bibcode=2014EcoAp..24..181O |issn=1939-5582}}</ref><ref>{{Cite journal|last1=Evaristo|first1=Jaivime|last2=McDonnell|first2=Jeffrey J.| last3=Scholl|first3=Martha A.|last4=Bruijnzeel|first4=L. Adrian|last5=Chun|first5=Kwok P.|date=2016-01-01|title=Insights into plant water uptake from xylem-water isotope measurements in two tropical catchments with contrasting moisture conditions| journal=Hydrological Processes|volume=30|issue=18|language=en|pages=3210–3227|doi=10.1002/hyp.10841|issn=1099-1085| bibcode=2016HyPr...30.3210E|s2cid=131588159 }}</ref>
* Bayesian inference is used to estimate parameters in stochastic chemical kinetic models<ref>{{Cite journal|last1=Gupta| first1=Ankur|last2=Rawlings|first2=James B.|date=April 2014|title=Comparison of Parameter Estimation Methods in Stochastic Chemical Kinetic Models: Examples in Systems Biology|journal=AIChE Journal|volume=60|issue=4|pages=1253–1268| doi=10.1002/aic.14409| issn=0001-1541|pmc=4946376|pmid=27429455| bibcode=2014AIChE..60.1253G}}</ref>
* Bayesian inference in [[econophysics]] for currency or prediction of trend changes in financial quotations<ref>{{Cite journal|last=Fornalski| first=K.W.|title=The Tadpole Bayesian Model for Detecting Trend Changes in Financial Quotations|journal=R&R Journal of Statistics and Mathematical Sciences|date=2016|volume=2|issue=1|pages=117–122|url=http://www.rroij.com/open-access/the-tadpole-bayesian-model-for-detecting-trend-changesin-financial-quotations-.pdf}}</ref><ref>{{Cite journal|last1=Schütz|first1=N.| last2=Holschneider| first2=M.|date=2011|title=Detection of trend changes in time series using Bayesian inference|journal=Physical Review E|volume=84|issue=2|page=021120| doi=10.1103/PhysRevE.84.021120|pmid=21928962| arxiv=1104.3448| bibcode=2011PhRvE..84b1120S | s2cid=11460968}}</ref>
*[[Bayesian inference in marketing]]
*[[Bayesian inference in motor learning]]
* Bayesian inference is used in [[probabilistic numerics]] to solve numerical problems

==Bayes and Bayesian inference==
The problem considered by Bayes in Proposition&nbsp;9 of his essay, "[[An Essay Towards Solving a Problem in the Doctrine of Chances]]", is the posterior distribution for the parameter ''a'' (the success rate) of the [[binomial distribution]].{{Citation needed|date=August 2010}}

==History==
{{Main|History of statistics#Bayesian statistics}}

The term ''Bayesian'' refers to [[Thomas Bayes]] (1701–1761), who proved that probabilistic limits could be placed on an unknown event.{{Reference needed|date=July 2022}}   However, it was [[Pierre-Simon Laplace]] (1749–1827) who introduced (as Principle VI) what is now called [[Bayes' theorem]] and used it to address problems in [[celestial mechanics]], medical statistics, [[Reliability (statistics)|reliability]], and [[jurisprudence]].<ref name="Stigler1986" /> Early Bayesian inference, which used uniform priors following Laplace's [[principle of insufficient reason]], was called "[[inverse probability]]" (because it [[Inductive reasoning|infer]]s backwards from observations to parameters, or from effects to causes<ref name=Fienberg2006/>). After the 1920s, "inverse probability" was largely supplanted  by a collection of methods that came to be called [[frequentist statistics]].<ref name=Fienberg2006/>

In the 20th century, the ideas of Laplace were further developed in two different directions, giving rise to ''objective'' and ''subjective'' currents in Bayesian practice. In the objective or "non-informative" current, the statistical analysis depends on only the model assumed, the data analyzed,<ref name="Bernardo2005"/> and the method assigning the prior, which differs from one objective Bayesian practitioner to another. In the subjective or "informative" current, the specification of the prior depends on the belief (that is, propositions on which the analysis is prepared to act), which can summarize information from experts, previous studies, etc.

In the 1980s, there was a dramatic growth in research and applications of Bayesian methods, mostly attributed to the discovery of [[Markov chain Monte Carlo]] methods, which removed many of the computational problems, and an increasing interest in nonstandard, complex applications.<ref name="Wolpert2004"/> Despite growth of Bayesian research, most undergraduate teaching is still based on frequentist statistics.<ref name="Bernardo2006"/> Nonetheless, Bayesian methods are widely accepted and used, such as for example in the field of [[machine learning]].<ref name="Bishop2007" />

==See also==
{{cmn|
* [[Bayesian approaches to brain function]]
* [[Credibility theory]]
* [[Epistemology]]
* [[Free energy principle]]
* [[Inductive probability]]
* [[Information field theory]]
* [[Principle of maximum entropy]]
* [[Probabilistic causation]]
* [[Probabilistic programming]]
}}

== References ==
=== Citations ===
{{Reflist|refs=
<ref name="Stigler1986">{{cite book |first=Stephen M. |last=Stigler |year=1986 |title=The History of Statistics |chapter-url=https://archive.org/details/historyofstatist00stig |chapter-url-access=registration |publisher=Harvard University Press |chapter=Chapter 3 |isbn=9780674403406 }}</ref>
<ref name=Fienberg2006>{{cite journal|first=Stephen E. |last=Fienberg |year=2006 |title=When did Bayesian Inference Become 'Bayesian'? |journal=Bayesian Analysis |volume=1 |issue=1 |pages=1–40 [p. 5] |doi=10.1214/06-ba101 |doi-access=free }}</ref>
<ref name=Bernardo2005>{{cite book |author-link=José-Miguel Bernardo |first=José-Miguel |last=Bernardo |year=2005 |chapter=Reference analysis |title=Handbook of statistics |volume=25 |pages=17–90 }}</ref>
<ref name="Wolpert2004">{{cite journal |last=Wolpert |first=R.&nbsp;L. |year=2004 |title=A Conversation with James O. Berger |journal=Statistical Science |volume=19 |issue=1 |pages=205–218 |doi=10.1214/088342304000000053 |mr=2082155 |citeseerx=10.1.1.71.6112 |s2cid=120094454 }}</ref>
<ref name="Bishop2007">{{cite book |last=Bishop |first=C. M. |title=Pattern Recognition and Machine Learning |publisher=Springer |year=2007 |location=New York |isbn=978-0387310732 }}</ref>
<ref name="Bernardo2006">{{cite journal |author-link=José-Miguel Bernardo |first=José M. |last=Bernardo |year=2006 |url=http://www.ime.usp.br/~abe/ICOTS7/Proceedings/PDFs/InvitedPapers/3I2_BERN.pdf |title=A Bayesian mathematical statistics primer |journal=Icots-7 }}</ref>
}}

=== Sources ===
{{refbegin}}
* Aster, Richard; Borchers, Brian, and Thurber, Clifford (2012). ''Parameter Estimation and Inverse Problems'', Second Edition, Elsevier. {{ISBN|0123850487}}, {{ISBN|978-0123850485}}
* {{cite book |author1 = Bickel, Peter J.  |author2 = Doksum, Kjell A. |name-list-style=amp |title = Mathematical Statistics, Volume 1: Basic and Selected Topics |edition=Second (updated printing 2007) |year=2001 |publisher=Pearson Prentice–Hall |isbn = 978-0-13-850363-5 }}
* [[George E. P. Box|Box, G.&nbsp;E.&nbsp;P.]] and [[George Tiao|Tiao, G.&nbsp;C.]] (1973). ''Bayesian Inference in Statistical Analysis'', Wiley, {{ISBN|0-471-57428-7}}
* {{cite book |author = Edwards, Ward |chapter = Conservatism in Human Information Processing |editor = Kleinmuntz, B. |title = Formal Representation of Human Judgment |publisher=Wiley |year=1968 }}
* {{cite journal |author=Edwards, Ward | quote=Chapter: Conservatism in Human Information Processing (excerpted) |editor= Daniel Kahneman |editor-link= Daniel Kahneman |editor2=Paul Slovic |editor2-link=Paul Slovic |editor3=Amos Tversky |editor3-link=Amos Tversky |title = Judgment under uncertainty: Heuristics and biases |journal=Science |volume=185 |issue=4157 |pages=1124–1131 |year=1982 |bibcode = 1974Sci...185.1124T |doi = 10.1126/science.185.4157.1124 |pmid = 17835457 | s2cid=143452957 }}
* [[Edwin Thompson Jaynes|Jaynes E.&nbsp;T.]] (2003) ''Probability Theory: The Logic of Science'', CUP. {{ISBN|978-0-521-59271-0}} ([http://www-biba.inrialpes.fr/Jaynes/prob.html Link to Fragmentary Edition of March 1996]).
* {{cite book |title=Scientific Reasoning: the Bayesian Approach| author=Howson, C. |author2=Urbach, P. |name-list-style=amp |publisher=[[Open Court Publishing Company]] |year=2005 |edition=3rd |isbn=978-0-8126-9578-6 | author-link=Colin Howson }}
* {{cite book |last1=Phillips |first1=L. D.|last2=Edwards |first2=Ward |chapter=Chapter 6: Conservatism in a Simple Probability Inference Task (''Journal of Experimental Psychology'' (1966) 72: 346-354) |title = A Science of Decision Making:The Legacy of Ward Edwards| editor=Jie W. Weiss |editor2=David J. Weiss |isbn=978-0-19-532298-9 |page=536 |date=October 2008 |publisher = Oxford University Press }}
{{refend}}

==Further reading==

* For a full report on the history of Bayesian statistics and the debates with frequentists approaches, read {{cite book |last=Vallverdu |first=Jordi |title=Bayesians Versus Frequentists A Philosophical Debate on Statistical Reasoning |publisher=Springer |year=2016 |location=New York |isbn=978-3-662-48638-2 }}
* {{Cite book |last=Clayton |first=Aubrey |author-link=Aubrey Clayton|url=https://cup.columbia.edu/book/bernoullis-fallacy/9780231199940 |title=Bernoulli's Fallacy: Statistical Illogic and the Crisis of Modern Science |date=August 2021 |publisher=Columbia University Press |isbn=978-0-231-55335-3}}

===Elementary===
The following books are listed in ascending order of probabilistic sophistication:
* Stone, JV (2013), "Bayes' Rule: A Tutorial Introduction to Bayesian Analysis",   [http://jim-stone.staff.shef.ac.uk/BookBayes2012/BayesRuleBookMain.html Download first  chapter here], Sebtel Press, England.
* {{Cite book| title=Understanding Uncertainty, Revised Edition| author=Dennis V. Lindley | publisher=John Wiley | year=2013| edition=2nd | isbn=978-1-118-65012-7| author-link=Dennis V. Lindley }}
* {{Cite book| title=Scientific Reasoning: The Bayesian Approach| author=Colin Howson |author2=Peter Urbach |name-list-style=amp | publisher=[[Open Court Publishing Company]]| year=2005| edition=3rd | isbn=978-0-8126-9578-6| author-link=Colin Howson }}
* {{Cite book|author=Berry, Donald A.|title=Statistics: A Bayesian Perspective|publisher=Duxbury| year=1996|isbn=978-0-534-23476-8}}
*{{Cite book|author=Morris H. DeGroot|author2=Mark J. Schervish|name-list-style=amp|title=Probability and Statistics|edition=third|isbn=978-0-201-52488-8|publisher=Addison-Wesley|year=2002|url-access=registration|url=https://archive.org/details/probabilitystati00degr_0|author-link=Morris H. DeGroot}}
* Bolstad, William M. (2007) ''Introduction to Bayesian Statistics'': Second Edition, John Wiley {{ISBN|0-471-27020-2}}
*{{cite book |author=Winkler, Robert L |title=Introduction to Bayesian Inference and Decision |publisher=Probabilistic |year=2003 |isbn=978-0-9647938-4-2 |edition=2nd  }} Updated classic textbook. Bayesian theory clearly presented.
* Lee, Peter M. ''Bayesian Statistics: An Introduction''. Fourth Edition (2012), John Wiley {{ISBN|978-1-1183-3257-3}}
* {{Cite book| title = Bayesian Methods for Data Analysis, Third Edition | publisher = Boca Raton, FL: Chapman and Hall/CRC | year = 2008 | isbn = 978-1-58488-697-6|author1=Carlin, Bradley P.  |author2=Louis, Thomas A. |name-list-style=amp }}
* {{Cite book| title = Bayesian Data Analysis, Third Edition | publisher = Chapman and Hall/CRC | year = 2013 | isbn = 978-1-4398-4095-5|last1=Gelman|first1=Andrew|author-link1=Andrew Gelman|last2=Carlin|first2=John B.|last3=Stern|first3=Hal S.|last4=Dunson|first4=David B.|last5=Vehtari|first5=Aki|last6=Rubin|first6=Donald B.|author-link6=Donald Rubin}}

===Intermediate or advanced===
* {{Cite book|author=Berger, James O| title=Statistical Decision Theory and Bayesian Analysis| edition=Second|year=1985| publisher=Springer-Verlag|series=Springer Series in Statistics|isbn=978-0-387-96098-2| bibcode=1985sdtb.book.....B| author-link=James Berger (statistician)}}
*{{Cite book|title=Bayesian Theory|publisher=Wiley|year=1994|author-link1=José-Miguel Bernardo|last1=Bernardo|first1=José&nbsp;M.|author-link2=Adrian Smith (statistician)|last2=Smith|first2=Adrian&nbsp;F.&nbsp;M.}}
* [[Morris H. DeGroot|DeGroot, Morris H.]], ''Optimal Statistical Decisions''. Wiley Classics Library. 2004. (Originally published (1970) by McGraw-Hill.) {{ISBN|0-471-68029-X}}.
* {{cite book|title=Theory of statistics|first=Mark J.|last=Schervish|publisher=Springer-Verlag|year=1995|isbn=978-0-387-94546-0}}
* Jaynes, E. T. (1998). [http://www-biba.inrialpes.fr/Jaynes/prob.html ''Probability Theory: The Logic of Science''].
* O'Hagan, A. and Forster, J. (2003). ''Kendall's Advanced Theory of Statistics'', Volume 2B: ''Bayesian Inference''. Arnold, New York. {{ISBN|0-340-52922-9}}.
* {{Cite book|author=Robert, Christian P|title=The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation|publisher=Springer|year=2007|edition=paperback|isbn=978-0-387-71598-8}}
* [[Judea Pearl|Pearl, Judea]]. (1988). ''Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference'', San Mateo, CA: Morgan Kaufmann.
* Pierre Bessière et al. (2013). "[http://www.crcpress.com/product/isbn/9781439880326 Bayesian Programming]". CRC Press. {{ISBN|9781439880326}}
* Francisco J. Samaniego (2010). "A Comparison of the Bayesian and Frequentist Approaches to Estimation". Springer. New York, {{ISBN|978-1-4419-5940-9}}

==External links==
* {{springer|title=Bayesian approach to statistical problems|id=p/b015390}}
* [http://www.scholarpedia.org/article/Bayesian_statistics Bayesian Statistics] from Scholarpedia.
* [http://www.dcs.qmw.ac.uk/%7Enorman/BBNs/BBNs.htm Introduction to Bayesian probability] from Queen Mary University of London
* [http://webuser.bus.umich.edu/plenk/downloads.htm Mathematical Notes on Bayesian Statistics and Markov Chain Monte Carlo]
* [http://cocosci.berkeley.edu/tom/bayes.html Bayesian reading list] {{Webarchive|url=https://web.archive.org/web/20110625052506/http://cocosci.berkeley.edu/tom/bayes.html |date=2011-06-25 }}, categorized and annotated by [https://web.archive.org/web/20060711151352/http://psychology.berkeley.edu/faculty/profiles/tgriffiths.html Tom Griffiths]
* A. Hajek and S. Hartmann: [https://web.archive.org/web/20110728055439/http://stephanhartmann.org/HajekHartmann_BayesEpist.pdf Bayesian Epistemology], in: J. Dancy et al. (eds.), A Companion to Epistemology. Oxford: Blackwell 2010, 93–106.
* S. Hartmann and J. Sprenger: [https://web.archive.org/web/20110728055519/http://stephanhartmann.org/HartmannSprenger_BayesEpis.pdf Bayesian Epistemology], in: S. Bernecker and D. Pritchard (eds.), Routledge Companion to Epistemology. London: Routledge 2010, 609–620.
* [http://plato.stanford.edu/entries/logic-inductive/ ''Stanford Encyclopedia of Philosophy'': "Inductive Logic"]
*[https://web.archive.org/web/20150905093734/http://faculty-staff.ou.edu/H/James.A.Hawthorne-1/Hawthorne--Bayesian_Confirmation_Theory.pdf Bayesian Confirmation Theory] (PDF)
* [http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-7.html What is Bayesian Learning?]
* [https://causascientia.org/math_stat/DataUnkInf.html ''Data, Uncertainty and Inference''] — Informal introduction with many examples, ebook (PDF) freely available at [https://causascientia.org causaScientia]

{{Statistics|inference}}

{{Authority control}}

{{DEFAULTSORT:Bayesian Inference}}
[[Category:Bayesian inference| ]]
[[Category:Logic and statistics]]
[[Category:Statistical forecasting]]
[[Category:Probabilistic arguments]]