Editing Bayesian inference (section)

===Bayesian inference===

*The [[prior distribution]] is the distribution of the parameter(s) before any data is observed, i.e. <math>p(\theta \mid \alpha)</math> . The prior distribution might not be easily determined; in such a case, one possibility may be to use the [[Jeffreys prior]] to obtain a prior distribution before updating it with newer observations.
*The [[sampling distribution]] is the distribution of the observed data conditional on its parameters, i.e. {{nowrap|<math>p(\mathbf{X} \mid \theta)</math>.}}  This is also termed the [[likelihood function|likelihood]], especially when viewed as a function of the parameter(s), sometimes written <math>\operatorname{L}(\theta  \mid \mathbf{X}) = p(\mathbf{X} \mid \theta)</math>.
*The [[marginal likelihood]] (sometimes also termed the ''evidence'') is the distribution of the observed data [[marginal distribution|marginalized]] over the parameter(s), i.e. <math display="block">p(\mathbf{X} \mid \alpha) = \int p(\mathbf{X} \mid \theta) p(\theta \mid \alpha) d\theta.</math> It quantifies the agreement between data and expert opinion, in a geometric sense that can be made precise.<ref name="deCarvalho-Geometry">{{Cite journal |last1=de Carvalho|first1=Miguel| last2=Page| first2=Garritt| last3 = Barney| first3 = Bradley| title = On the geometry of Bayesian inference|journal=Bayesian Analysis|year=2019|volume=14 |issue=4 |pages=1013‒1036| doi=10.1214/18-BA1112|s2cid=88521802 |url = https://www.maths.ed.ac.uk/~mdecarv/papers/decarvalho2018.pdf}}</ref> If the marginal likelihood is 0 then there is no agreement between the data and expert opinion and Bayes' rule cannot be applied.
*The [[posterior distribution]] is the distribution of the parameter(s) after taking into account the observed data.  This is determined by [[Bayes' rule]], which forms the heart of Bayesian inference: <math display="block">p(\theta \mid \mathbf{X},\alpha) = \frac{p(\theta,\mathbf{X},\alpha)}{p(\mathbf{X},\alpha)} = \frac{p(\mathbf{X}\mid\theta,\alpha)p(\theta,\alpha)}{p(\mathbf{X}\mid\alpha)p(\alpha)}
= \frac{p(\mathbf{X} \mid \theta,\alpha) p(\theta \mid \alpha)}{p(\mathbf{X} \mid \alpha)} \propto p(\mathbf{X} \mid \theta,\alpha) p(\theta \mid \alpha).</math> This is expressed in words as "posterior is proportional to likelihood times prior", or sometimes as "posterior = likelihood times prior, over evidence".
* In practice, for almost all complex Bayesian models used in machine learning, the posterior distribution <math>p(\theta \mid \mathbf{X},\alpha)</math> is not obtained in a closed form distribution, mainly because the parameter space for <math>\theta</math> can be very high, or the Bayesian model retains certain hierarchical structure formulated from the observations <math>\mathbf{X}</math> and parameter <math>\theta</math>. In such situations, we need to resort to approximation techniques.<ref name="Lee-GibbsSampler">{{Cite journal |last=Lee|first=Se Yoon|  title = Gibbs sampler and coordinate ascent variational inference: A set-theoretical review|journal=Communications in Statistics – Theory and Methods|year=2021|volume=51 |issue=6 |pages=1549–1568| doi=10.1080/03610926.2021.1921214|arxiv=2008.01006|s2cid=220935477}}</ref>
* General case: Let <math>P_Y^x </math> be the conditional distribution of <math>Y</math> given <math>X = x</math> and let <math>P_X</math> be the distribution of <math>X</math>. The joint distribution is then <math>P_{X,Y} (dx,dy) = P_Y^x (dy) P_X (dx)</math>. The conditional distribution <math>P_X^y </math> of <math>X</math>  given <math>Y=y</math> is then determined by
<math display="block">P_X^y (A) = E (1_A (X) | Y = y)</math>Existence and uniqueness of the needed [[conditional expectation]] is a consequence of the [[Radon–Nikodym theorem]]. This was formulated by [[Andrey Kolmogorov|Kolmogorov]] in his famous book from 1933. Kolmogorov underlines the importance of conditional probability by writing "I wish to call attention to  ... and especially the theory of conditional probabilities and conditional expectations ..." in the Preface.<ref>{{Cite book |last=Kolmogorov |first=A.N. |title=Foundations of the Theory of Probability |publisher=Chelsea Publishing Company |year=1933 |orig-year=1956}}</ref> The Bayes theorem determines the posterior distribution from the prior distribution. Uniqueness requires continuity assumptions.<ref>{{Cite book |last=Tjur |first=Tue |url=http://archive.org/details/probabilitybased0000tjur |title=Probability based on Radon measures |date=1980 |publisher=Chichester [Eng.] ; New York : Wiley |others=Internet Archive |isbn=978-0-471-27824-5}}</ref> Bayes' theorem can be generalized to include improper prior distributions such as the uniform distribution on the real line.<ref>{{Cite journal |last1=Taraldsen |first1=Gunnar |last2=Tufto |first2=Jarle |last3=Lindqvist |first3=Bo H. |date=2021-07-24 |title=Improper priors and improper posteriors |journal=Scandinavian Journal of Statistics |language=en |volume=49 |issue=3 |pages=969–991 |doi=10.1111/sjos.12550 |issn=0303-6898 |s2cid=237736986 |doi-access=free |hdl-access=free |hdl=11250/2984409}}</ref> Modern [[Markov chain Monte Carlo]] methods have boosted the importance of Bayes' theorem including cases with improper priors.<ref>{{Cite book |last1=Robert |first1=Christian P. |url=http://worldcat.org/oclc/1159112760 |title=Monte Carlo Statistical Methods |last2=Casella |first2=George |publisher=Springer |year=2004 |isbn=978-1475741452 |oclc=1159112760}}</ref>