Editing Normal distribution (section)

==== With unknown mean and unknown variance ====
For a set of [[i.i.d.]] normally distributed data points '''X''' of size ''n'' where each individual point ''x'' follows <math display=inline>x \sim \mathcal{N}(\mu, \sigma^2)</math> with unknown mean μ and unknown [[variance]] σ<sup>2</sup>, a combined (multivariate) [[conjugate prior]] is placed over the mean and variance, consisting of a [[normal-inverse-gamma distribution]].
Logically, this originates as follows:
# From the analysis of the case with unknown mean but known variance, we see that the update equations involve [[sufficient statistic]]s computed from the data consisting of the mean of the data points and the total variance of the data points, computed in turn from the known variance divided by the number of data points.
# From the analysis of the case with unknown variance but known mean, we see that the update equations involve sufficient statistics over the data consisting of the number of data points and [[sum of squared deviations]].
# Keep in mind that the posterior update values serve as the prior distribution when further data is handled. Thus, we should logically think of our priors in terms of the sufficient statistics just described, with the same semantics kept in mind as much as possible.
# To handle the case where both mean and variance are unknown, we could place independent priors over the mean and variance, with fixed estimates of the average mean, total variance, number of data points used to compute the variance prior, and sum of squared deviations. Note however that in reality, the total variance of the mean depends on the unknown variance, and the sum of squared deviations that goes into the variance prior (appears to) depend on the unknown mean. In practice, the latter dependence is relatively unimportant: Shifting the actual mean shifts the generated points by an equal amount, and on average the squared deviations will remain the same. This is not the case, however, with the total variance of the mean: As the unknown variance increases, the total variance of the mean will increase proportionately, and we would like to capture this dependence.
# This suggests that we create a ''conditional prior'' of the mean on the unknown variance, with a hyperparameter specifying the mean of the [[pseudo-observation]]s associated with the prior, and another parameter specifying the number of pseudo-observations. This number serves as a scaling parameter on the variance, making it possible to control the overall variance of the mean relative to the actual variance parameter. The prior for the variance also has two hyperparameters, one specifying the sum of squared deviations of the pseudo-observations associated with the prior, and another specifying once again the number of pseudo-observations. Each of the priors has a hyperparameter specifying the number of pseudo-observations, and in each case this controls the relative variance of that prior. These are given as two separate hyperparameters so that the variance (aka the confidence) of the two priors can be controlled separately.
# This leads immediately to the [[normal-inverse-gamma distribution]], which is the product of the two distributions just defined, with [[conjugate prior]]s used (an [[inverse gamma distribution]] over the variance, and a normal distribution over the mean, ''conditional'' on the variance) and with the same four parameters just defined.

The priors are normally defined as follows:

<math display=block>\begin{align}
p(\mu\mid\sigma^2; \mu_0, n_0) &\sim \mathcal{N}(\mu_0,\sigma^2/n_0) \\
p(\sigma^2; \nu_0,\sigma_0^2) &\sim I\chi^2(\nu_0,\sigma_0^2) = IG(\nu_0/2, \nu_0\sigma_0^2/2)
\end{align}</math>
<!-- \\
 & =\frac{(\sigma_0^2\nu_0/2)^{\nu_0/2}}{\Gamma(\nu_0/2)}~\frac{\exp\left[ \frac{-\nu_0 \sigma_0^2}{2 \sigma^2}\right]}{(\sigma^2)^{1+\nu_0/2}} \propto \frac{\exp\left[ \frac{-\nu_0 \sigma_0^2}{2 \sigma^2}\right]}{(\sigma^2)^{1+\nu_0/2}}
-->

The update equations can be derived, and look as follows:

<math display=block>\begin{align}
\bar{x} &= \frac 1 n \sum_{i=1}^n x_i \\
\mu_0' &= \frac{n_0\mu_0 + n\bar{x}}{n_0 + n} \\
n_0' &= n_0 + n \\
\nu_0' &= \nu_0 + n \\
\nu_0'{\sigma_0^2}' &= \nu_0 \sigma_0^2 + \sum_{i=1}^n (x_i-\bar{x})^2 + \frac{n_0 n}{n_0 + n}(\mu_0 - \bar{x})^2
\end{align}</math>

The respective numbers of pseudo-observations add the number of actual observations to them. The new mean hyperparameter is once again a weighted average, this time weighted by the relative numbers of observations. Finally, the update for <math display=inline>\nu_0'{\sigma_0^2}'</math> is similar to the case with known mean, but in this case the sum of squared deviations is taken with respect to the observed data mean rather than the true mean, and as a result a new interaction term needs to be added to take care of the additional error source stemming from the deviation between prior and data mean.

{{math proof | proof =
The prior distributions are
<math display=block>\begin{align}
p(\mu\mid\sigma^2; \mu_0, n_0) &\sim \mathcal{N}(\mu_0,\sigma^2/n_0) = \frac{1}{\sqrt{2\pi\frac{\sigma^2}{n_0}}} \exp\left(-\frac{n_0}{2\sigma^2}(\mu-\mu_0)^2\right) \\
&\propto (\sigma^2)^{-1/2} \exp\left(-\frac{n_0}{2\sigma^2}(\mu-\mu_0)^2\right) \\
p(\sigma^2; \nu_0,\sigma_0^2) &\sim I\chi^2(\nu_0,\sigma_0^2) = IG(\nu_0/2, \nu_0\sigma_0^2/2) \\
&= \frac{(\sigma_0^2\nu_0/2)^{\nu_0/2}}{\Gamma(\nu_0/2)}~\frac{\exp\left[ \frac{-\nu_0 \sigma_0^2}{2 \sigma^2}\right]}{(\sigma^2)^{1+\nu_0/2}} \\
&\propto {(\sigma^2)^{-(1+\nu_0/2)}} \exp\left[ \frac{-\nu_0 \sigma_0^2}{2 \sigma^2}\right].
\end{align}</math>

Therefore, the joint prior is

<math display=block>\begin{align}
p(\mu,\sigma^2; \mu_0, n_0, \nu_0,\sigma_0^2) &= p(\mu\mid\sigma^2; \mu_0, n_0)\,p(\sigma^2; \nu_0,\sigma_0^2) \\
&\propto (\sigma^2)^{-(\nu_0+3)/2} \exp\left[-\frac 1 {2\sigma^2}\left(\nu_0\sigma_0^2 + n_0(\mu-\mu_0)^2\right)\right].
\end{align}</math>

The [[likelihood function]] from the section above with known variance is:

<math display=block>\begin{align}
p(\mathbf{X}\mid\mu,\sigma^2) &= \left(\frac{1}{2\pi\sigma^2}\right)^{n/2} \exp\left[-\frac{1}{2\sigma^2} \left(\sum_{i=1}^n(x_i -\mu)^2\right)\right]
\end{align}</math>

Writing it in terms of variance rather than precision, we get:
<math display=block>\begin{align}
p(\mathbf{X}\mid\mu,\sigma^2) &= \left(\frac{1}{2\pi\sigma^2}\right)^{n/2} \exp\left[-\frac{1}{2\sigma^2} \left(\sum_{i=1}^n(x_i-\bar{x})^2 + n(\bar{x} -\mu)^2\right)\right] \\
&\propto {\sigma^2}^{-n/2} \exp\left[-\frac{1}{2\sigma^2} \left(S + n(\bar{x} -\mu)^2\right)\right]
\end{align}</math>
where <math display=inline>S = \sum_{i=1}^n(x_i-\bar{x})^2.</math>

Therefore, the posterior is (dropping the hyperparameters as conditioning factors):
<math display=block>\begin{align}
p(\mu,\sigma^2\mid\mathbf{X}) & \propto p(\mu,\sigma^2) \, p(\mathbf{X}\mid\mu,\sigma^2) \\
& \propto (\sigma^2)^{-(\nu_0+3)/2} \exp\left[-\frac{1}{2\sigma^2}\left(\nu_0\sigma_0^2 + n_0(\mu-\mu_0)^2\right)\right] {\sigma^2}^{-n/2} \exp\left[-\frac{1}{2\sigma^2} \left(S + n(\bar{x} -\mu)^2\right)\right] \\
&= (\sigma^2)^{-(\nu_0+n+3)/2} \exp\left[-\frac{1}{2\sigma^2}\left(\nu_0\sigma_0^2 + S + n_0(\mu-\mu_0)^2 + n(\bar{x} -\mu)^2\right)\right] \\
&= (\sigma^2)^{-(\nu_0+n+3)/2} \exp\left[-\frac{1}{2\sigma^2}\left(\nu_0\sigma_0^2 + S + \frac{n_0 n}{n_0+n}(\mu_0-\bar{x})^2 + (n_0+n)\left(\mu-\frac{n_0\mu_0 + n\bar{x}}{n_0 + n}\right)^2\right)\right] \\
& \propto (\sigma^2)^{-1/2} \exp\left[-\frac{n_0+n}{2\sigma^2}\left(\mu-\frac{n_0\mu_0 + n\bar{x}}{n_0 + n}\right)^2\right] \\
& \quad\times (\sigma^2)^{-(\nu_0/2+n/2+1)} \exp\left[-\frac{1}{2\sigma^2}\left(\nu_0\sigma_0^2 + S + \frac{n_0 n}{n_0+n}(\mu_0-\bar{x})^2\right)\right] \\
& = \mathcal{N}_{\mu\mid\sigma^2}\left(\frac{n_0\mu_0 + n\bar{x}}{n_0 + n}, \frac{\sigma^2}{n_0+n}\right) \cdot {\rm IG}_{\sigma^2}\left(\frac12(\nu_0+n), \frac12\left(\nu_0\sigma_0^2 + S + \frac{n_0 n}{n_0+n}(\mu_0-\bar{x})^2\right)\right).
\end{align}</math>

In other words, the posterior distribution has the form of a product of a normal distribution over <math display=inline>p(\mu|\sigma^2)</math> times an inverse gamma distribution over <math display=inline>p(\sigma^2)</math>, with parameters that are the same as the update equations above.
}}