Editing Normal distribution (section)

== Statistical inference ==
=== Estimation of parameters ===
{{See also|Maximum likelihood#Continuous distribution, continuous parameter space|Gaussian function#Estimation of parameters}}

It is often the case that we do not know the parameters of the normal distribution, but instead want to [[Estimation theory|estimate]] them. That is, having a sample <math display=inline>(x_1, \ldots, x_n)</math> from a normal <math display=inline>\mathcal{N}(\mu, \sigma^2)</math> population we would like to learn the approximate values of parameters {{tmath|\mu}} and <math display=inline>\sigma^2</math>. The standard approach to this problem is the [[maximum likelihood]] method, which requires maximization of the ''[[log-likelihood function]]'':{{anchor|Log-likelihood}}
<math display=block>
   \ln\mathcal{L}(\mu,\sigma^2)
     = \sum_{i=1}^n \ln f(x_i\mid\mu,\sigma^2)
     = -\frac{n}{2}\ln(2\pi) - \frac{n}{2}\ln\sigma^2 - \frac{1}{2\sigma^2}\sum_{i=1}^n (x_i-\mu)^2.
</math>
Taking derivatives with respect to {{tmath|\mu}} and <math display=inline>\sigma^2</math> and solving the resulting system of first order conditions yields the ''maximum likelihood estimates'':
<math display=block>
    \hat{\mu} = \overline{x} \equiv \frac{1}{n}\sum_{i=1}^n x_i, \qquad
    \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \overline{x})^2.
</math>

Then <math display=inline>\ln\mathcal{L}(\hat{\mu},\hat{\sigma}^2)</math> is as follows:

<math display=block>\ln\mathcal{L}(\hat{\mu},\hat{\sigma}^2)
     = (-n/2) [\ln(2 \pi \hat{\sigma}^2)+1]</math>

==== Sample mean ====
{{See also|Standard error of the mean}}

Estimator <math style="vertical-align:-.3em">\textstyle\hat\mu</math> is called the ''[[sample mean]]'', since it is the arithmetic mean of all observations. The statistic <math style="vertical-align:0">\textstyle\overline{x}</math> is [[complete statistic|complete]] and [[sufficient statistic|sufficient]] for {{tmath|\mu}}, and therefore by the [[Lehmann–Scheffé theorem]], <math style="vertical-align:-.3em">\textstyle\hat\mu</math> is the [[uniformly minimum variance unbiased]] (UMVU) estimator.<ref name="Krishnamoorthy">{{harvtxt |Krishnamoorthy |2006 |p=127 }}</ref> In finite samples it is distributed normally:
<math display=block>
    \hat\mu \sim \mathcal{N}(\mu,\sigma^2/n).
</math>
The variance of this estimator is equal to the ''μμ''-element of the inverse [[Fisher information matrix]] <math style="vertical-align:0">\textstyle\mathcal{I}^{-1}</math>. This implies that the estimator is [[efficient estimator|finite-sample efficient]]. Of practical importance is the fact that the [[standard error]] of <math style="vertical-align:-.3em">\textstyle\hat\mu</math> is proportional to <math style="vertical-align:-.3em">\textstyle1/\sqrt{n}</math>, that is, if one wishes to decrease the standard error by a factor of 10, one must increase the number of points in the sample by a factor of 100. This fact is widely used in determining sample sizes for opinion polls and the number of trials in [[Monte Carlo simulation]]s.

From the standpoint of the [[asymptotic theory (statistics)|asymptotic theory]], <math style="vertical-align:-.3em">\textstyle\hat\mu</math> is [[consistent estimator|consistent]], that is, it [[converges in probability]] to {{tmath|\mu}} as <math display=inline>n\rightarrow\infty</math>. The estimator is also [[asymptotic normality|asymptotically normal]], which is a simple corollary of the fact that it is normal in finite samples:
<math display=block>
    \sqrt{n}(\hat\mu-\mu) \,\xrightarrow{d}\, \mathcal{N}(0,\sigma^2).
</math>

==== Sample variance ====
{{See also|Standard deviation#Estimation|Variance#Estimation}}

The estimator <math style="vertical-align:0">\textstyle\hat\sigma^2</math> is called the ''[[sample variance]]'', since it is the variance of the sample (<math display=inline>(x_1, \ldots, x_n)</math>). In practice, another estimator is often used instead of the <math style="vertical-align:0">\textstyle\hat\sigma^2</math>. This other estimator is denoted  <math display=inline>s^2</math>, and is also called the ''sample variance'', which represents a certain ambiguity in terminology; its square root {{tmath|s}} is called the ''sample standard deviation''. The estimator <math display=inline>s^2</math> differs from <math style="vertical-align:0">\textstyle\hat\sigma^2</math> by having {{math|(''n'' − 1)}} instead of&nbsp;''n'' in the denominator (the so-called [[Bessel's correction]]):
<math display=block>
    s^2 = \frac{n}{n-1} \hat\sigma^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \overline{x})^2.
</math>
The difference between <math display=inline>s^2</math> and <math style="vertical-align:0">\textstyle\hat\sigma^2</math> becomes negligibly small for large ''n''{{'}}s. In finite samples however, the motivation behind the use of <math display=inline>s^2</math> is that it is an [[unbiased estimator]] of the underlying parameter <math display=inline>\sigma^2</math>, whereas <math style="vertical-align:0">\textstyle\hat\sigma^2</math> is biased. Also, by the Lehmann–Scheffé theorem the estimator <math display=inline>s^2</math> is uniformly minimum variance unbiased ([[UMVU]]),<ref name="Krishnamoorthy" /> which makes it the "best" estimator among all unbiased ones. However it can be shown that the biased estimator <math style="vertical-align:0">\textstyle\hat\sigma^2</math> is better than the <math display=inline>s^2</math> in terms of the [[mean squared error]] (MSE) criterion. In finite samples both <math display=inline>s^2</math> and <math style="vertical-align:0">\textstyle\hat\sigma^2</math> have scaled [[chi-squared distribution]] with {{math|(''n'' − 1)}} degrees of freedom:
<math display=block>
    s^2 \sim \frac{\sigma^2}{n-1} \cdot \chi^2_{n-1}, \qquad
    \hat\sigma^2 \sim \frac{\sigma^2}{n} \cdot \chi^2_{n-1}.
</math>
The first of these expressions shows that the variance of <math display=inline>s^2</math> is equal to <math display=inline>2\sigma^4/(n-1)</math>, which is slightly greater than the ''σσ''-element of the inverse Fisher information matrix <math style="vertical-align:0">\textstyle\mathcal{I}^{-1}</math>, which is <math display=inline>2\sigma^4/n</math>. Thus, <math display=inline>s^2</math> is not an efficient estimator for <math display=inline>\sigma^2</math>, and moreover, since <math display=inline>s^2</math> is UMVU, we can conclude that the finite-sample efficient estimator for <math display=inline>\sigma^2</math> does not exist.

Applying the asymptotic theory, both estimators <math display=inline>s^2</math> and <math style="vertical-align:0">\textstyle\hat\sigma^2</math> are consistent, that is they converge in probability to <math display=inline>\sigma^2</math> as the sample size <math display=inline>n\rightarrow\infty</math>. The two estimators are also both asymptotically normal:
<math display=block>
    \sqrt{n}(\hat\sigma^2 - \sigma^2) \simeq
    \sqrt{n}(s^2-\sigma^2) \,\xrightarrow{d}\, \mathcal{N}(0,2\sigma^4).
</math>
In particular, both estimators are asymptotically efficient for  <math display=inline>\sigma^2</math>.

=== Confidence intervals ===
{{See also|Studentization|3-sigma rule}}

By [[Cochran's theorem]], for normal distributions the sample mean <math style="vertical-align:-.3em">\textstyle\hat\mu</math> and the sample variance ''s''<sup>2</sup> are [[independence (probability theory)|independent]], which means there can be no gain in considering their [[joint distribution]]. There is also a converse theorem: if in a sample the sample mean and sample variance are independent, then the sample must have come from the normal distribution. The independence between <math style="vertical-align:-.3em">\textstyle\hat\mu</math> and ''s'' can be employed to construct the so-called ''t-statistic'':
<math display=block>
    t = \frac{\hat\mu-\mu}{s/\sqrt{n}} = \frac{\overline{x}-\mu}{\sqrt{\frac{1}{n(n-1)}\sum(x_i-\overline{x})^2}} \sim t_{n-1}
</math>
This quantity ''t'' has the [[Student's t-distribution]] with {{math|(''n'' − 1)}} degrees of freedom, and it is an [[ancillary statistic]] (independent of the value of the parameters). Inverting the distribution of this ''t''-statistics will allow us to construct the [[confidence interval]] for ''μ'';<ref>{{harvtxt |Krishnamoorthy |2006 |p=130 }}</ref> similarly, inverting the ''χ''<sup>2</sup> distribution of the statistic ''s''<sup>2</sup> will give us the confidence interval for ''σ''<sup>2</sup>:<ref>{{harvtxt |Krishnamoorthy |2006 |p=133 }}</ref>
<math display=block>\mu \in \left[ \hat\mu - t_{n-1,1-\alpha/2} \frac{s}{\sqrt{n}},\,
                                   \hat\mu + t_{n-1,1-\alpha/2} \frac{s}{\sqrt{n}} \right]</math>
<math display=block>\sigma^2 \in \left[ \frac{n-1}{\chi^2_{n-1,1-\alpha/2}}s^2,\,
                                        \frac{n-1}{\chi^2_{n-1,\alpha/2}}s^2\right]</math>
where ''t<sub>k,p</sub>'' and {{SubSup|χ|''k,p''|2}} are the ''p''th [[quantile]]s of the ''t''- and ''χ''<sup>2</sup>-distributions respectively. These confidence intervals are of the ''[[confidence level]]'' {{math|1 − ''α''}}, meaning that the true values ''μ'' and ''σ''<sup>2</sup> fall outside of these intervals with probability (or [[significance level]]) ''α''. In practice people usually take {{math|''α'' {{=}} 5%}}, resulting in the 95% confidence intervals. The confidence interval for ''σ'' can be found by taking the square root of the interval bounds for ''σ''<sup>2</sup>.

Approximate formulas can be derived from the asymptotic distributions of <math style="vertical-align:-.3em">\textstyle\hat\mu</math> and ''s''<sup>2</sup>:
<math display=block>\mu \in
              \left[ \hat\mu - \frac{|z_{\alpha/2}|}{\sqrt n}s,\,
                      \hat\mu + \frac{|z_{\alpha/2}|}{\sqrt n}s \right]</math>
<math display=block>\sigma^2 \in
                   \left[ s^2 - \sqrt{2}\frac{|z_{\alpha/2}|}{\sqrt{n}} s^2 ,\,
                          s^2 + \sqrt{2}\frac{|z_{\alpha/2}|}{\sqrt{n}} s^2 \right]</math>
The approximate formulas become valid for large values of ''n'', and are more convenient for the manual calculation since the standard normal quantiles ''z''<sub>''α''/2</sub> do not depend on ''n''. In particular, the most popular value of {{math|''α'' {{=}} 5%}}, results in {{math|{{!}}''z''<sub>0.025</sub>{{!}} {{=}} [[1.96]]}}.

=== Normality tests ===
{{Main|Normality tests}}

Normality tests assess the likelihood that the given data set {''x''<sub>1</sub>, ..., ''x<sub>n</sub>''} comes from a normal distribution. Typically the [[null hypothesis]] ''H''<sub>0</sub> is that the observations are distributed normally with unspecified mean ''μ'' and variance ''σ''<sup>2</sup>, versus the alternative ''H<sub>a</sub>'' that the distribution is arbitrary. Many tests (over 40) have been devised for this problem. The more prominent of them are outlined below:

'''Diagnostic plots''' are more intuitively appealing but subjective at the same time, as they rely on informal human judgement to accept or reject the null hypothesis.
* [[Q–Q plot]], also known as [[normal probability plot]] or [[rankit]] plot—is a plot of the sorted values from the data set against the expected values of the corresponding quantiles from the standard normal distribution. That is, it is a plot of point of the form (Φ<sup>−1</sup>(''p<sub>k</sub>''), ''x''<sub>(''k'')</sub>), where plotting points ''p<sub>k</sub>'' are equal to ''p<sub>k</sub>''&nbsp;=&nbsp;(''k''&nbsp;−&nbsp;''α'')/(''n''&nbsp;+&nbsp;1&nbsp;−&nbsp;2''α'') and ''α'' is an adjustment constant, which can be anything between 0 and&nbsp;1. If the null hypothesis is true, the plotted points should approximately lie on a straight line.
* [[P–P plot]] – similar to the Q–Q plot, but used much less frequently. This method consists of plotting the points (Φ(''z''<sub>(''k'')</sub>), ''p<sub>k</sub>''), where <math display=inline>\textstyle z_{(k)} = (x_{(k)}-\hat\mu)/\hat\sigma</math>. For normally distributed data this plot should lie on a 45° line between (0,&nbsp;0) and&nbsp;(1,&nbsp;1).
'''Goodness-of-fit tests''':

''Moment-based tests'':
* [[D'Agostino's K-squared test]]
* [[Jarque–Bera test]]
* [[Shapiro–Wilk test]]: This is based on the fact that the line in the Q–Q plot has the slope of ''σ''. The test compares the least squares estimate of that slope with the value of the sample variance, and rejects the null hypothesis if these two quantities differ significantly.
''Tests based on the empirical distribution function'':
* [[Anderson–Darling test]]
* [[Lilliefors test]] (an adaptation of the [[Kolmogorov–Smirnov test]])

=== Bayesian analysis of the normal distribution ===
Bayesian analysis of normally distributed data is complicated by the many different possibilities that may be considered:
* Either the mean, or the variance, or neither, may be considered a fixed quantity.
* When the variance is unknown, analysis may be done directly in terms of the variance, or in terms of the [[precision (statistics)|precision]], the reciprocal of the variance. The reason for expressing the formulas in terms of precision is that the analysis of most cases is simplified.
* Both univariate and [[multivariate normal distribution|multivariate]] cases need to be considered.
* Either [[conjugate prior|conjugate]] or [[improper prior|improper]] [[prior distribution]]s may be placed on the unknown variables.
* An additional set of cases occurs in [[Bayesian linear regression]], where in the basic model the data is assumed to be normally distributed, and normal priors are placed on the [[regression coefficient]]s. The resulting analysis is similar to the basic cases of [[independent identically distributed]] data.

The formulas for the non-linear-regression cases are summarized in the [[conjugate prior]] article.

==== Sum of two quadratics ====

===== Scalar form =====
The following auxiliary formula is useful for simplifying the [[posterior distribution|posterior]] update equations, which otherwise become fairly tedious.

<math display=block>a(x-y)^2 + b(x-z)^2 = (a + b)\left(x - \frac{ay+bz}{a+b}\right)^2 + \frac{ab}{a+b}(y-z)^2</math>

This equation rewrites the sum of two quadratics in ''x'' by expanding the squares, grouping the terms in ''x'', and [[completing the square]]. Note the following about the complex constant factors attached to some of the terms:
# The factor <math display=inline>\frac{ay+bz}{a+b}</math> has the form of a [[weighted average]] of ''y'' and ''z''.
# <math display=inline>\frac{ab}{a+b} = \frac{1}{\frac{1}{a}+\frac{1}{b}} = (a^{-1} + b^{-1})^{-1}.</math> This shows that this factor can be thought of as resulting from a situation where the [[Multiplicative inverse|reciprocals]] of quantities ''a'' and ''b'' add directly, so to combine ''a'' and ''b'' themselves, it is necessary to reciprocate, add, and reciprocate the result again to get back into the original units. This is exactly the sort of operation performed by the [[harmonic mean]], so it is not surprising that <math display=inline>\frac{ab}{a+b}</math> is one-half the [[harmonic mean]] of ''a'' and ''b''.

===== Vector form =====
A similar formula can be written for the sum of two vector quadratics: If '''x''', '''y''', '''z''' are vectors of length ''k'', and '''A''' and '''B''' are [[symmetric matrix|symmetric]], [[invertible matrices]] of size <math display=inline>k\times k</math>, then

<math display=block>
\begin{align}
& (\mathbf{y}-\mathbf{x})'\mathbf{A}(\mathbf{y}-\mathbf{x}) + (\mathbf{x}-\mathbf{z})' \mathbf{B}(\mathbf{x}-\mathbf{z}) \\
= {} & (\mathbf{x} - \mathbf{c})'(\mathbf{A}+\mathbf{B})(\mathbf{x} - \mathbf{c}) + (\mathbf{y} - \mathbf{z})'(\mathbf{A}^{-1} + \mathbf{B}^{-1})^{-1}(\mathbf{y} - \mathbf{z})
\end{align}
</math>

where

<math display=block>\mathbf{c} = (\mathbf{A} + \mathbf{B})^{-1}(\mathbf{A}\mathbf{y} + \mathbf{B} \mathbf{z})</math>

The form '''x'''′ '''A''' '''x''' is called a [[quadratic form]] and is a [[scalar (mathematics)|scalar]]:
<math display=block>\mathbf{x}'\mathbf{A}\mathbf{x} = \sum_{i,j}a_{ij} x_i x_j</math>
In other words, it sums up all possible combinations of products of pairs of elements from '''x''', with a separate coefficient for each. In addition, since <math display=inline>x_i x_j = x_j x_i</math>, only the sum <math display=inline>a_{ij} + a_{ji}</math> matters for any off-diagonal elements of '''A''', and there is no loss of generality in assuming that '''A''' is [[symmetric matrix|symmetric]]. Furthermore, if '''A''' is symmetric, then the form <math display=inline>\mathbf{x}'\mathbf{A}\mathbf{y} = \mathbf{y}'\mathbf{A}\mathbf{x}.</math>

==== Sum of differences from the mean ====
Another useful formula is as follows:
<math display=block>\sum_{i=1}^n (x_i-\mu)^2 = \sum_{i=1}^n (x_i-\bar{x})^2 + n(\bar{x} -\mu)^2</math>
where <math display=inline>\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i.</math>

=== With known variance ===
For a set of [[i.i.d.]] normally distributed data points '''X''' of size ''n'' where each individual point ''x'' follows <math display=inline>x \sim \mathcal{N}(\mu, \sigma^2)</math> with known [[variance]] σ<sup>2</sup>, the [[conjugate prior]] distribution is also normally distributed.

This can be shown more easily by rewriting the variance as the [[precision (statistics)|precision]], i.e. using τ = 1/σ<sup>2</sup>. Then if <math display=inline>x \sim \mathcal{N}(\mu, 1/\tau)</math> and <math display=inline>\mu \sim \mathcal{N}(\mu_0, 1/\tau_0),</math> we proceed as follows.

First, the [[likelihood function]] is (using the formula above for the sum of differences from the mean):

<math display=block>\begin{align}
p(\mathbf{X}\mid\mu,\tau) &= \prod_{i=1}^n \sqrt{\frac{\tau}{2\pi}} \exp\left(-\frac{1}{2}\tau(x_i-\mu)^2\right) \\
&= \left(\frac{\tau}{2\pi}\right)^{n/2} \exp\left(-\frac{1}{2}\tau \sum_{i=1}^n (x_i-\mu)^2\right) \\
&= \left(\frac{\tau}{2\pi}\right)^{n/2} \exp\left[-\frac{1}{2}\tau \left(\sum_{i=1}^n(x_i-\bar{x})^2 + n(\bar{x} -\mu)^2\right)\right].
\end{align}</math>

Then, we proceed as follows:

<math display=block>\begin{align}
p(\mu\mid\mathbf{X}) &\propto p(\mathbf{X}\mid\mu) p(\mu) \\
& = \left(\frac{\tau}{2\pi}\right)^{n/2} \exp\left[-\frac{1}{2}\tau \left(\sum_{i=1}^n(x_i-\bar{x})^2 + n(\bar{x} -\mu)^2\right)\right] \sqrt{\frac{\tau_0}{2\pi}} \exp\left(-\frac{1}{2}\tau_0(\mu-\mu_0)^2\right) \\
&\propto \exp\left(-\frac{1}{2}\left(\tau\left(\sum_{i=1}^n(x_i-\bar{x})^2 + n(\bar{x} -\mu)^2\right) + \tau_0(\mu-\mu_0)^2\right)\right) \\
&\propto \exp\left(-\frac{1}{2} \left(n\tau(\bar{x}-\mu)^2 + \tau_0(\mu-\mu_0)^2 \right)\right) \\
&= \exp\left(-\frac{1}{2}(n\tau + \tau_0)\left(\mu - \dfrac{n\tau \bar{x} + \tau_0\mu_0}{n\tau + \tau_0}\right)^2 + \frac{n\tau\tau_0}{n\tau+\tau_0}(\bar{x} - \mu_0)^2\right) \\
&\propto \exp\left(-\frac{1}{2}(n\tau + \tau_0)\left(\mu - \dfrac{n\tau \bar{x} + \tau_0\mu_0}{n\tau + \tau_0}\right)^2\right)
\end{align}</math>

In the above derivation, we used the formula above for the sum of two quadratics and eliminated all constant factors not involving&nbsp;''μ''. The result is the [[kernel (statistics)|kernel]] of a normal distribution, with mean <math display=inline>\frac{n\tau \bar{x} + \tau_0\mu_0}{n\tau + \tau_0}</math> and precision <math display=inline>n\tau + \tau_0</math>, i.e.

<math display=block>p(\mu\mid\mathbf{X}) \sim \mathcal{N}\left(\frac{n\tau \bar{x} + \tau_0\mu_0}{n\tau + \tau_0}, \frac{1}{n\tau + \tau_0}\right)</math>

This can be written as a set of Bayesian update equations for the posterior parameters in terms of the prior parameters:

<math display=block>\begin{align}
\tau_0' &= \tau_0 + n\tau \\[5pt]
\mu_0' &= \frac{n\tau \bar{x} + \tau_0\mu_0}{n\tau + \tau_0} \\[5pt]
\bar{x} &= \frac{1}{n}\sum_{i=1}^n x_i
\end{align}</math>

That is, to combine ''n'' data points with total precision of ''nτ'' (or equivalently, total variance of ''n''/''σ''<sup>2</sup>) and mean of values <math display=inline>\bar{x}</math>, derive a new total precision simply by adding the total precision of the data to the prior total precision, and form a new mean through a ''precision-weighted average'', i.e. a [[weighted average]] of the data mean and the prior mean, each weighted by the associated total precision. This makes logical sense if the precision is thought of as indicating the certainty of the observations: In the distribution of the posterior mean, each of the input components is weighted by its certainty, and the certainty of this distribution is the sum of the individual certainties. (For the intuition of this, compare the expression "the whole is (or is not) greater than the sum of its parts". In addition, consider that the knowledge of the posterior comes from a combination of the knowledge of the prior and likelihood, so it makes sense that we are more certain of it than of either of its components.)

The above formula reveals why it is more convenient to do [[Bayesian analysis]] of [[conjugate prior]]s for the normal distribution in terms of the precision. The posterior precision is simply the sum of the prior and likelihood precisions, and the posterior mean is computed through a precision-weighted average, as described above. The same formulas can be written in terms of variance by reciprocating all the precisions, yielding the more ugly formulas

<math display=block>\begin{align}
{\sigma^2_0}' &= \frac{1}{\frac{n}{\sigma^2} + \frac{1}{\sigma_0^2}} \\[5pt]
\mu_0' &= \frac{\frac{n\bar{x}}{\sigma^2} + \frac{\mu_0}{\sigma_0^2}}{\frac{n}{\sigma^2} + \frac{1}{\sigma_0^2}} \\[5pt]
\bar{x} &= \frac{1}{n}\sum_{i=1}^n x_i
\end{align}</math>

==== With known mean ====
For a set of [[i.i.d.]] normally distributed data points '''X''' of size ''n'' where each individual point ''x'' follows <math display=inline>x \sim \mathcal{N}(\mu, \sigma^2)</math> with known mean μ, the [[conjugate prior]] of the [[variance]] has an [[inverse gamma distribution]] or a [[scaled inverse chi-squared distribution]]. The two are equivalent except for having different [[parameter]]izations. Although the inverse gamma is more commonly used, we use the scaled inverse chi-squared for the sake of convenience. The prior for σ<sup>2</sup> is as follows:

<math display=block>p(\sigma^2\mid\nu_0,\sigma_0^2) = \frac{(\sigma_0^2\frac{\nu_0}{2})^{\nu_0/2}}{\Gamma\left(\frac{\nu_0}{2} \right)}~\frac{\exp\left[ \frac{-\nu_0 \sigma_0^2}{2 \sigma^2}\right]}{(\sigma^2)^{1+\frac{\nu_0}{2}}} \propto \frac{\exp\left[ \frac{-\nu_0 \sigma_0^2}{2 \sigma^2}\right]}{(\sigma^2)^{1+\frac{\nu_0}{2}}}</math>

The [[likelihood function]] from above, written in terms of the variance, is:

<math display=block>\begin{align}
p(\mathbf{X}\mid\mu,\sigma^2) &= \left(\frac{1}{2\pi\sigma^2}\right)^{n/2} \exp\left[-\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i-\mu)^2\right] \\
&= \left(\frac{1}{2\pi\sigma^2}\right)^{n/2} \exp\left[-\frac{S}{2\sigma^2}\right]
\end{align}</math>

where

<math display=block>S = \sum_{i=1}^n (x_i-\mu)^2.</math>

Then:

<math display=block>\begin{align}
p(\sigma^2\mid\mathbf{X}) &\propto p(\mathbf{X}\mid\sigma^2) p(\sigma^2) \\
&= \left(\frac{1}{2\pi\sigma^2}\right)^{n/2} \exp\left[-\frac{S}{2\sigma^2}\right] \frac{(\sigma_0^2\frac{\nu_0}{2})^{\frac{\nu_0}{2}}}{\Gamma\left(\frac{\nu_0}{2} \right)}~\frac{\exp\left[ \frac{-\nu_0 \sigma_0^2}{2 \sigma^2}\right]}{(\sigma^2)^{1+\frac{\nu_0}{2}}} \\
&\propto \left(\frac{1}{\sigma^2}\right)^{n/2} \frac{1}{(\sigma^2)^{1+\frac{\nu_0}{2}}} \exp\left[-\frac{S}{2\sigma^2} + \frac{-\nu_0 \sigma_0^2}{2 \sigma^2}\right] \\
&= \frac{1}{(\sigma^2)^{1+\frac{\nu_0+n}{2}}} \exp\left[-\frac{\nu_0 \sigma_0^2 + S}{2\sigma^2}\right]
\end{align}</math>

The above is also a scaled inverse chi-squared distribution where

<math display=block>\begin{align}
\nu_0' &= \nu_0 + n \\
\nu_0'{\sigma_0^2}' &= \nu_0 \sigma_0^2 + \sum_{i=1}^n (x_i-\mu)^2
\end{align}</math>

or equivalently

<math display=block>\begin{align}
\nu_0' &= \nu_0 + n \\
{\sigma_0^2}' &= \frac{\nu_0 \sigma_0^2 + \sum_{i=1}^n (x_i-\mu)^2}{\nu_0+n}
\end{align}</math>

Reparameterizing in terms of an [[inverse gamma distribution]], the result is:

<math display=block>\begin{align}
\alpha' &= \alpha + \frac{n}{2} \\
\beta' &= \beta + \frac{\sum_{i=1}^n (x_i-\mu)^2}{2}
\end{align}</math>

==== With unknown mean and unknown variance ====
For a set of [[i.i.d.]] normally distributed data points '''X''' of size ''n'' where each individual point ''x'' follows <math display=inline>x \sim \mathcal{N}(\mu, \sigma^2)</math> with unknown mean μ and unknown [[variance]] σ<sup>2</sup>, a combined (multivariate) [[conjugate prior]] is placed over the mean and variance, consisting of a [[normal-inverse-gamma distribution]].
Logically, this originates as follows:
# From the analysis of the case with unknown mean but known variance, we see that the update equations involve [[sufficient statistic]]s computed from the data consisting of the mean of the data points and the total variance of the data points, computed in turn from the known variance divided by the number of data points.
# From the analysis of the case with unknown variance but known mean, we see that the update equations involve sufficient statistics over the data consisting of the number of data points and [[sum of squared deviations]].
# Keep in mind that the posterior update values serve as the prior distribution when further data is handled. Thus, we should logically think of our priors in terms of the sufficient statistics just described, with the same semantics kept in mind as much as possible.
# To handle the case where both mean and variance are unknown, we could place independent priors over the mean and variance, with fixed estimates of the average mean, total variance, number of data points used to compute the variance prior, and sum of squared deviations. Note however that in reality, the total variance of the mean depends on the unknown variance, and the sum of squared deviations that goes into the variance prior (appears to) depend on the unknown mean. In practice, the latter dependence is relatively unimportant: Shifting the actual mean shifts the generated points by an equal amount, and on average the squared deviations will remain the same. This is not the case, however, with the total variance of the mean: As the unknown variance increases, the total variance of the mean will increase proportionately, and we would like to capture this dependence.
# This suggests that we create a ''conditional prior'' of the mean on the unknown variance, with a hyperparameter specifying the mean of the [[pseudo-observation]]s associated with the prior, and another parameter specifying the number of pseudo-observations. This number serves as a scaling parameter on the variance, making it possible to control the overall variance of the mean relative to the actual variance parameter. The prior for the variance also has two hyperparameters, one specifying the sum of squared deviations of the pseudo-observations associated with the prior, and another specifying once again the number of pseudo-observations. Each of the priors has a hyperparameter specifying the number of pseudo-observations, and in each case this controls the relative variance of that prior. These are given as two separate hyperparameters so that the variance (aka the confidence) of the two priors can be controlled separately.
# This leads immediately to the [[normal-inverse-gamma distribution]], which is the product of the two distributions just defined, with [[conjugate prior]]s used (an [[inverse gamma distribution]] over the variance, and a normal distribution over the mean, ''conditional'' on the variance) and with the same four parameters just defined.

The priors are normally defined as follows:

<math display=block>\begin{align}
p(\mu\mid\sigma^2; \mu_0, n_0) &\sim \mathcal{N}(\mu_0,\sigma^2/n_0) \\
p(\sigma^2; \nu_0,\sigma_0^2) &\sim I\chi^2(\nu_0,\sigma_0^2) = IG(\nu_0/2, \nu_0\sigma_0^2/2)
\end{align}</math>
<!-- \\
 & =\frac{(\sigma_0^2\nu_0/2)^{\nu_0/2}}{\Gamma(\nu_0/2)}~\frac{\exp\left[ \frac{-\nu_0 \sigma_0^2}{2 \sigma^2}\right]}{(\sigma^2)^{1+\nu_0/2}} \propto \frac{\exp\left[ \frac{-\nu_0 \sigma_0^2}{2 \sigma^2}\right]}{(\sigma^2)^{1+\nu_0/2}}
-->

The update equations can be derived, and look as follows:

<math display=block>\begin{align}
\bar{x} &= \frac 1 n \sum_{i=1}^n x_i \\
\mu_0' &= \frac{n_0\mu_0 + n\bar{x}}{n_0 + n} \\
n_0' &= n_0 + n \\
\nu_0' &= \nu_0 + n \\
\nu_0'{\sigma_0^2}' &= \nu_0 \sigma_0^2 + \sum_{i=1}^n (x_i-\bar{x})^2 + \frac{n_0 n}{n_0 + n}(\mu_0 - \bar{x})^2
\end{align}</math>

The respective numbers of pseudo-observations add the number of actual observations to them. The new mean hyperparameter is once again a weighted average, this time weighted by the relative numbers of observations. Finally, the update for <math display=inline>\nu_0'{\sigma_0^2}'</math> is similar to the case with known mean, but in this case the sum of squared deviations is taken with respect to the observed data mean rather than the true mean, and as a result a new interaction term needs to be added to take care of the additional error source stemming from the deviation between prior and data mean.

{{math proof | proof =
The prior distributions are
<math display=block>\begin{align}
p(\mu\mid\sigma^2; \mu_0, n_0) &\sim \mathcal{N}(\mu_0,\sigma^2/n_0) = \frac{1}{\sqrt{2\pi\frac{\sigma^2}{n_0}}} \exp\left(-\frac{n_0}{2\sigma^2}(\mu-\mu_0)^2\right) \\
&\propto (\sigma^2)^{-1/2} \exp\left(-\frac{n_0}{2\sigma^2}(\mu-\mu_0)^2\right) \\
p(\sigma^2; \nu_0,\sigma_0^2) &\sim I\chi^2(\nu_0,\sigma_0^2) = IG(\nu_0/2, \nu_0\sigma_0^2/2) \\
&= \frac{(\sigma_0^2\nu_0/2)^{\nu_0/2}}{\Gamma(\nu_0/2)}~\frac{\exp\left[ \frac{-\nu_0 \sigma_0^2}{2 \sigma^2}\right]}{(\sigma^2)^{1+\nu_0/2}} \\
&\propto {(\sigma^2)^{-(1+\nu_0/2)}} \exp\left[ \frac{-\nu_0 \sigma_0^2}{2 \sigma^2}\right].
\end{align}</math>

Therefore, the joint prior is

<math display=block>\begin{align}
p(\mu,\sigma^2; \mu_0, n_0, \nu_0,\sigma_0^2) &= p(\mu\mid\sigma^2; \mu_0, n_0)\,p(\sigma^2; \nu_0,\sigma_0^2) \\
&\propto (\sigma^2)^{-(\nu_0+3)/2} \exp\left[-\frac 1 {2\sigma^2}\left(\nu_0\sigma_0^2 + n_0(\mu-\mu_0)^2\right)\right].
\end{align}</math>

The [[likelihood function]] from the section above with known variance is:

<math display=block>\begin{align}
p(\mathbf{X}\mid\mu,\sigma^2) &= \left(\frac{1}{2\pi\sigma^2}\right)^{n/2} \exp\left[-\frac{1}{2\sigma^2} \left(\sum_{i=1}^n(x_i -\mu)^2\right)\right]
\end{align}</math>

Writing it in terms of variance rather than precision, we get:
<math display=block>\begin{align}
p(\mathbf{X}\mid\mu,\sigma^2) &= \left(\frac{1}{2\pi\sigma^2}\right)^{n/2} \exp\left[-\frac{1}{2\sigma^2} \left(\sum_{i=1}^n(x_i-\bar{x})^2 + n(\bar{x} -\mu)^2\right)\right] \\
&\propto {\sigma^2}^{-n/2} \exp\left[-\frac{1}{2\sigma^2} \left(S + n(\bar{x} -\mu)^2\right)\right]
\end{align}</math>
where <math display=inline>S = \sum_{i=1}^n(x_i-\bar{x})^2.</math>

Therefore, the posterior is (dropping the hyperparameters as conditioning factors):
<math display=block>\begin{align}
p(\mu,\sigma^2\mid\mathbf{X}) & \propto p(\mu,\sigma^2) \, p(\mathbf{X}\mid\mu,\sigma^2) \\
& \propto (\sigma^2)^{-(\nu_0+3)/2} \exp\left[-\frac{1}{2\sigma^2}\left(\nu_0\sigma_0^2 + n_0(\mu-\mu_0)^2\right)\right] {\sigma^2}^{-n/2} \exp\left[-\frac{1}{2\sigma^2} \left(S + n(\bar{x} -\mu)^2\right)\right] \\
&= (\sigma^2)^{-(\nu_0+n+3)/2} \exp\left[-\frac{1}{2\sigma^2}\left(\nu_0\sigma_0^2 + S + n_0(\mu-\mu_0)^2 + n(\bar{x} -\mu)^2\right)\right] \\
&= (\sigma^2)^{-(\nu_0+n+3)/2} \exp\left[-\frac{1}{2\sigma^2}\left(\nu_0\sigma_0^2 + S + \frac{n_0 n}{n_0+n}(\mu_0-\bar{x})^2 + (n_0+n)\left(\mu-\frac{n_0\mu_0 + n\bar{x}}{n_0 + n}\right)^2\right)\right] \\
& \propto (\sigma^2)^{-1/2} \exp\left[-\frac{n_0+n}{2\sigma^2}\left(\mu-\frac{n_0\mu_0 + n\bar{x}}{n_0 + n}\right)^2\right] \\
& \quad\times (\sigma^2)^{-(\nu_0/2+n/2+1)} \exp\left[-\frac{1}{2\sigma^2}\left(\nu_0\sigma_0^2 + S + \frac{n_0 n}{n_0+n}(\mu_0-\bar{x})^2\right)\right] \\
& = \mathcal{N}_{\mu\mid\sigma^2}\left(\frac{n_0\mu_0 + n\bar{x}}{n_0 + n}, \frac{\sigma^2}{n_0+n}\right) \cdot {\rm IG}_{\sigma^2}\left(\frac12(\nu_0+n), \frac12\left(\nu_0\sigma_0^2 + S + \frac{n_0 n}{n_0+n}(\mu_0-\bar{x})^2\right)\right).
\end{align}</math>

In other words, the posterior distribution has the form of a product of a normal distribution over <math display=inline>p(\mu|\sigma^2)</math> times an inverse gamma distribution over <math display=inline>p(\sigma^2)</math>, with parameters that are the same as the update equations above.
}}