Editing Mean squared error (section)

==Definition and basic properties==

The MSE either assesses the quality of a ''[[predictor (statistics)|predictor]]'' (i.e., a function mapping arbitrary inputs to a sample of values of some [[random variable]]), or of an ''[[estimator]]'' (i.e., a [[mathematical function]] mapping a [[Sample (statistics)|sample]] of data to an estimate of a [[Statistical parameter|parameter]] of the [[Statistical population|population]] from which the data is sampled). In the context of prediction, understanding the [[prediction interval]] can also be useful as it provides a range within which a future observation will fall, with a certain probability. The definition of an MSE differs according to whether one is describing a predictor or an estimator.

===Predictor===

If a vector of <math>n</math> predictions is generated from a sample of <math>n</math> data points on all variables, and <math>Y</math> is the vector of observed values of the variable being predicted, with <math>\hat{Y}</math> being the predicted values (e.g. as from a [[least-squares fit]]), then the within-sample MSE of the predictor is computed as

:<math>\operatorname{MSE}=\frac{1}{n} \sum_{i=1}^n \left(Y_i-\hat{Y_i}\right)^2</math>

In other words, the MSE is the ''mean'' <math display="inline">\left(\frac{1}{n} \sum_{i=1}^n \right)</math> of the ''squares of the errors'' <math display="inline">\left(Y_i-\hat{Y_i}\right)^2</math>. This is an easily computable quantity for a particular sample (and hence is sample-dependent).

In [[Matrix_multiplication|matrix]] notation, 
:<math>\operatorname{MSE}=\frac{1}{n}\sum_{i=1}^n(e_i)^2=\frac{1}{n}\mathbf e^\mathsf T \mathbf e</math>
where <math>e_i</math> is <math> (Y_i-\hat{Y_i}) </math> and <math>\mathbf e</math> is a <math> n \times 1 </math> column vector.

The MSE can also be computed on ''q ''data points that were not used in estimating the model, either because they were held back for this purpose, or because these data have been newly obtained. Within this process, known as [[Cross-validation (statistics)|cross-validation]], the MSE is often called the [[test MSE]],<ref>{{cite book
|first1=James
|last1=Gareth
|first2=Daniela
|last2=Witten
|first3=Trevor
|last3=Hastie
|first4=Rob
|last4=Tibshirani
|date=2021
|title=An Introduction to Statistical Learning: with Applications in R
|url=https://www.statlearning.com/
|publisher=Springer
|isbn=978-1071614174
}}</ref> and is computed as

:<math>\operatorname{MSE} = \frac{1}{q} \sum_{i=n+1}^{n+q} \left(Y_i-\hat{Y_i}\right)^2</math>

===Estimator===

The MSE of an estimator <math>\hat{\theta}</math> with respect to an unknown parameter <math>\theta</math> is defined as<ref name=":1" />

:<math>\operatorname{MSE}(\hat{\theta})=\operatorname{E}_{\theta}\left[(\hat{\theta}-\theta)^2\right].</math>

This definition depends on the unknown parameter, therefore the MSE is a ''priori property'' of an estimator. The MSE could be a function of unknown parameters, in which case any ''estimator'' of the MSE based on estimates of these parameters would be a function of the data (and thus a random variable). If the estimator <math>\hat{\theta}</math> is derived as a sample statistic and is used to estimate some population parameter, then the expectation is with respect to the [[sampling distribution]] of the sample statistic. 

The MSE can be written as the sum of the [[variance]] of the estimator and the squared [[Bias_of_an_estimator|bias]] of the estimator, providing a useful way to calculate the MSE and implying that in the case of unbiased estimators, the MSE and variance are equivalent.<ref name="wackerly">{{cite book |first1=Dennis |last1=Wackerly |first2=William|last2=Mendenhall |first3=Richard L.|last3=Scheaffer |title=Mathematical Statistics with Applications |publisher=Thomson Higher Education|location=Belmont, CA, USA |year=2008 |edition=7 |isbn=978-0-495-38508-0}}</ref>

:<math>\operatorname{MSE}(\hat{\theta})=\operatorname{Var}_{\theta}(\hat{\theta})+ \operatorname{Bias}(\hat{\theta},\theta)^2.</math>

====Proof of variance and bias relationship====

<math>\begin{align}
\operatorname{MSE}(\hat{\theta})
&= \operatorname{E}_\theta \left [(\hat{\theta}-\theta)^2 \right ] \\
&= \operatorname{E}_\theta\left[\left(\hat{\theta}-\operatorname{E}_\theta [\hat\theta]+\operatorname{E}_\theta[\hat\theta]-\theta\right)^2\right]\\ 
&= \operatorname{E}_\theta\left[\left(\hat{\theta}-\operatorname{E}_\theta[\hat\theta]\right)^2 +2\left (\hat{\theta}-\operatorname{E}_\theta[\hat\theta] \right ) \left (\operatorname{E}_\theta[\hat\theta]-\theta \right )+\left( \operatorname{E}_\theta[\hat\theta]-\theta \right)^2\right] \\ 
&= \operatorname{E}_\theta\left[\left(\hat{\theta}-\operatorname{E}_\theta[\hat\theta]\right)^2\right]+\operatorname{E}_\theta\left[2 \left (\hat{\theta}-\operatorname{E}_\theta[\hat\theta] \right ) \left (\operatorname{E}_\theta[\hat\theta]-\theta \right ) \right] + \operatorname{E}_\theta\left [ \left(\operatorname{E}_\theta[\hat\theta]-\theta\right)^2 \right] \\
&= \operatorname{E}_\theta\left[\left(\hat{\theta}-\operatorname{E}_\theta[\hat\theta]\right)^2\right]+ 2 \left(\operatorname{E}_\theta[\hat\theta]-\theta\right) \operatorname{E}_\theta\left[\hat{\theta}-\operatorname{E}_\theta[\hat\theta] \right] +  \left(\operatorname{E}_\theta[\hat\theta]-\theta\right)^2 && \operatorname{E}_\theta[\hat\theta]-\theta = \text{constant} \\
&= \operatorname{E}_\theta\left[\left(\hat{\theta}-\operatorname{E}_\theta[\hat\theta]\right)^2\right]+ 2 \left(\operatorname{E}_\theta [\hat\theta]-\theta\right) \left ( \operatorname{E}_\theta[\hat{\theta}]-\operatorname{E}_\theta[\hat\theta] \right )+  \left(\operatorname{E}_\theta[\hat\theta]-\theta\right)^2 && \operatorname{E}_\theta[\hat\theta] = \text{constant} \\
&= \operatorname{E}_\theta\left[\left(\hat\theta-\operatorname{E}_\theta[\hat\theta]\right)^2\right]+\left(\operatorname{E}_\theta [\hat\theta]-\theta\right)^2\\ 
&= \operatorname{Var}_\theta(\hat\theta)+ \operatorname{Bias}_\theta(\hat\theta,\theta)^2
\end{align}</math>

An even shorter proof can be achieved using the well-known formula that for a random variable <math display="inline">X</math>, <math display="inline">\mathbb{E}(X^2) = \operatorname{Var}(X) + (\mathbb{E}(X))^2</math>. By substituting <math display="inline">X</math> with, <math display="inline">\hat\theta-\theta</math>, we have
:<math display="block">\begin{aligned}
\operatorname{MSE}(\hat{\theta}) &= \mathbb{E}[(\hat\theta-\theta)^2] \\
&= \operatorname{Var}(\hat{\theta} - \theta) + (\mathbb{E}[\hat\theta - \theta])^2 \\
&= \operatorname{Var}(\hat\theta) + \operatorname{Bias}^2(\hat\theta,\theta)
\end{aligned}</math>
But in real modeling case, MSE could be described as the addition of model variance, model bias, and irreducible uncertainty (see [[Bias–variance tradeoff]]). According to the relationship, the MSE of the estimators could be simply used for the [[Efficiency (statistics)|efficiency]] comparison, which includes the information of estimator variance and bias. This is called MSE criterion.