Editing Bayesian inference (section)

==Formal description of Bayesian inference==

===Definitions===
*<math>x</math>, a data point in general.  This may in fact be a [[random vector|vector]] of values.
*<math>\theta</math>, the [[parameter]] of the data point's distribution, i.e., {{nowrap|<math>x \sim p(x \mid \theta)</math>.}}  This may be a [[random vector|vector]] of parameters.
*<math>\alpha</math>, the [[Hyperparameter (Bayesian statistics)|hyperparameter]] of the parameter distribution, i.e., {{nowrap|<math>\theta \sim p(\theta \mid \alpha)</math>.}}  This may be a [[random vector|vector]] of hyperparameters.
*<math>\mathbf{X}</math> is the sample, a set of <math>n</math> observed data points, i.e., <math>x_1, \ldots, x_n</math>.
*<math>\tilde{x}</math>, a new data point whose distribution is to be predicted.

===Bayesian inference===

*The [[prior distribution]] is the distribution of the parameter(s) before any data is observed, i.e. <math>p(\theta \mid \alpha)</math> . The prior distribution might not be easily determined; in such a case, one possibility may be to use the [[Jeffreys prior]] to obtain a prior distribution before updating it with newer observations.
*The [[sampling distribution]] is the distribution of the observed data conditional on its parameters, i.e. {{nowrap|<math>p(\mathbf{X} \mid \theta)</math>.}}  This is also termed the [[likelihood function|likelihood]], especially when viewed as a function of the parameter(s), sometimes written <math>\operatorname{L}(\theta  \mid \mathbf{X}) = p(\mathbf{X} \mid \theta)</math>.
*The [[marginal likelihood]] (sometimes also termed the ''evidence'') is the distribution of the observed data [[marginal distribution|marginalized]] over the parameter(s), i.e. <math display="block">p(\mathbf{X} \mid \alpha) = \int p(\mathbf{X} \mid \theta) p(\theta \mid \alpha) d\theta.</math> It quantifies the agreement between data and expert opinion, in a geometric sense that can be made precise.<ref name="deCarvalho-Geometry">{{Cite journal |last1=de Carvalho|first1=Miguel| last2=Page| first2=Garritt| last3 = Barney| first3 = Bradley| title = On the geometry of Bayesian inference|journal=Bayesian Analysis|year=2019|volume=14 |issue=4 |pages=1013‒1036| doi=10.1214/18-BA1112|s2cid=88521802 |url = https://www.maths.ed.ac.uk/~mdecarv/papers/decarvalho2018.pdf}}</ref> If the marginal likelihood is 0 then there is no agreement between the data and expert opinion and Bayes' rule cannot be applied.
*The [[posterior distribution]] is the distribution of the parameter(s) after taking into account the observed data.  This is determined by [[Bayes' rule]], which forms the heart of Bayesian inference: <math display="block">p(\theta \mid \mathbf{X},\alpha) = \frac{p(\theta,\mathbf{X},\alpha)}{p(\mathbf{X},\alpha)} = \frac{p(\mathbf{X}\mid\theta,\alpha)p(\theta,\alpha)}{p(\mathbf{X}\mid\alpha)p(\alpha)}
= \frac{p(\mathbf{X} \mid \theta,\alpha) p(\theta \mid \alpha)}{p(\mathbf{X} \mid \alpha)} \propto p(\mathbf{X} \mid \theta,\alpha) p(\theta \mid \alpha).</math> This is expressed in words as "posterior is proportional to likelihood times prior", or sometimes as "posterior = likelihood times prior, over evidence".
* In practice, for almost all complex Bayesian models used in machine learning, the posterior distribution <math>p(\theta \mid \mathbf{X},\alpha)</math> is not obtained in a closed form distribution, mainly because the parameter space for <math>\theta</math> can be very high, or the Bayesian model retains certain hierarchical structure formulated from the observations <math>\mathbf{X}</math> and parameter <math>\theta</math>. In such situations, we need to resort to approximation techniques.<ref name="Lee-GibbsSampler">{{Cite journal |last=Lee|first=Se Yoon|  title = Gibbs sampler and coordinate ascent variational inference: A set-theoretical review|journal=Communications in Statistics – Theory and Methods|year=2021|volume=51 |issue=6 |pages=1549–1568| doi=10.1080/03610926.2021.1921214|arxiv=2008.01006|s2cid=220935477}}</ref>
* General case: Let <math>P_Y^x </math> be the conditional distribution of <math>Y</math> given <math>X = x</math> and let <math>P_X</math> be the distribution of <math>X</math>. The joint distribution is then <math>P_{X,Y} (dx,dy) = P_Y^x (dy) P_X (dx)</math>. The conditional distribution <math>P_X^y </math> of <math>X</math>  given <math>Y=y</math> is then determined by
<math display="block">P_X^y (A) = E (1_A (X) | Y = y)</math>Existence and uniqueness of the needed [[conditional expectation]] is a consequence of the [[Radon–Nikodym theorem]]. This was formulated by [[Andrey Kolmogorov|Kolmogorov]] in his famous book from 1933. Kolmogorov underlines the importance of conditional probability by writing "I wish to call attention to  ... and especially the theory of conditional probabilities and conditional expectations ..." in the Preface.<ref>{{Cite book |last=Kolmogorov |first=A.N. |title=Foundations of the Theory of Probability |publisher=Chelsea Publishing Company |year=1933 |orig-year=1956}}</ref> The Bayes theorem determines the posterior distribution from the prior distribution. Uniqueness requires continuity assumptions.<ref>{{Cite book |last=Tjur |first=Tue |url=http://archive.org/details/probabilitybased0000tjur |title=Probability based on Radon measures |date=1980 |publisher=Chichester [Eng.] ; New York : Wiley |others=Internet Archive |isbn=978-0-471-27824-5}}</ref> Bayes' theorem can be generalized to include improper prior distributions such as the uniform distribution on the real line.<ref>{{Cite journal |last1=Taraldsen |first1=Gunnar |last2=Tufto |first2=Jarle |last3=Lindqvist |first3=Bo H. |date=2021-07-24 |title=Improper priors and improper posteriors |journal=Scandinavian Journal of Statistics |language=en |volume=49 |issue=3 |pages=969–991 |doi=10.1111/sjos.12550 |issn=0303-6898 |s2cid=237736986 |doi-access=free |hdl-access=free |hdl=11250/2984409}}</ref> Modern [[Markov chain Monte Carlo]] methods have boosted the importance of Bayes' theorem including cases with improper priors.<ref>{{Cite book |last1=Robert |first1=Christian P. |url=http://worldcat.org/oclc/1159112760 |title=Monte Carlo Statistical Methods |last2=Casella |first2=George |publisher=Springer |year=2004 |isbn=978-1475741452 |oclc=1159112760}}</ref>

===Bayesian prediction===

*The [[posterior predictive distribution]] is the distribution of a new data point, marginalized over the posterior: <math display="block">p(\tilde{x} \mid \mathbf{X},\alpha) = \int p(\tilde{x} \mid \theta) p(\theta \mid \mathbf{X},\alpha) d\theta</math>
*The [[prior predictive distribution]] is the distribution of a new data point, marginalized over the prior: <math display="block">p(\tilde{x} \mid \alpha) = \int p(\tilde{x} \mid \theta) p(\theta \mid \alpha) d\theta</math>

Bayesian theory calls for the use of the posterior predictive distribution to do [[predictive inference]], i.e., to [[prediction|predict]] the distribution of a new, unobserved data point. That is, instead of a fixed point as a prediction, a distribution over possible points is returned.  Only this way is the entire posterior distribution of the parameter(s) used.  By comparison, prediction in [[frequentist statistics]] often involves finding an optimum point estimate of the parameter(s)—e.g., by [[maximum likelihood]] or [[maximum a posteriori estimation]] (MAP)—and then plugging this estimate into the formula for the distribution of a data point. This has the disadvantage that it does not account for any uncertainty in the value of the parameter, and hence will underestimate the [[variance]] of the predictive distribution.

In some instances, frequentist statistics can work around this problem. For example, [[confidence interval]]s and [[prediction interval]]s in frequentist statistics when constructed from a [[normal distribution]] with unknown [[mean]] and [[variance]] are constructed using a [[Student's t-distribution]].  This correctly estimates the variance, due to the facts that (1)&nbsp;the average of normally distributed random variables is also normally distributed, and (2) the predictive distribution of a normally distributed data point with unknown mean and variance, using conjugate or uninformative priors, has a Student's t-distribution. In Bayesian statistics, however, the posterior predictive distribution can always be determined exactly—or at least to an arbitrary level of precision when numerical methods are used.

Both types of predictive distributions have the form of a [[compound probability distribution]] (as does the [[marginal likelihood]]). In fact, if the prior distribution is a [[conjugate prior]], such that the prior and posterior distributions come from the same family, it can be seen that both prior and posterior predictive distributions also come from the same family of compound distributions. The only difference is that the posterior predictive distribution uses the updated values of the hyperparameters (applying the Bayesian update rules given in the [[conjugate prior]] article), while the prior predictive distribution uses the values of the hyperparameters that appear in the prior distribution.