Editing Cross-validation (statistics) (section)

===Example: linear regression ===
In linear regression, there exist [[real number|real]] ''response values'' <math display="inline">y_1,\ldots, y_n </math>, and ''n'' ''p''-dimensional [[Euclidean vector|vector]] ''covariates'' '''''x'''''<sub>1</sub>, ..., '''''x'''''<sub>''n''</sub>.  The components of the vector '''''x'''''<sub>''i''</sub> are denoted ''x''<sub>''i''1</sub>, ..., ''x''<sub>''ip''</sub>. If [[least squares]] is used to fit a function in the form of a [[hyperplane]] '''''ŷ''''' = ''a'' + '''''β'''''<sup>T</sup>'''''x''''' to the data ('''''x'''''<sub>''i''</sub>, ''y''<sub>''i''</sub>)<sub>&nbsp;1&nbsp;≤&nbsp;''i''&nbsp;≤&nbsp;''n''</sub>, then the fit can be assessed using the [[mean squared error]] (MSE). The MSE for given estimated parameter values ''a'' and '''''β''''' on the training set ('''''x'''''<sub>''i''</sub>, ''y''<sub>''i''</sub>)<sub>&nbsp;1&nbsp;≤&nbsp;''i''&nbsp;≤&nbsp;''n''</sub> is defined as:

:<math>\begin{align}
\text{MSE} &= \frac 1 n \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \frac 1 n \sum_{i=1}^n (y_i - a - \boldsymbol\beta^T \mathbf{x}_i)^2\\&= \frac{1}{n}\sum_{i=1}^n (y_i - a - \beta_1x_{i1} - \dots - \beta_px_{ip})^2
\end{align}</math>

If the model is correctly specified, it can be shown under mild assumptions that the [[expected value]] of the MSE for the training set is (''n''&nbsp;&minus;&nbsp;''p''&nbsp;&minus;&nbsp;1)/(''n''&nbsp;+&nbsp;''p''&nbsp;+&nbsp;1)&nbsp;<&nbsp;1 times the expected value of the MSE for the validation set (the expected value is taken over the distribution of training sets).  Thus, a fitted model and computed MSE on the training set will result in an optimistically [[bias (statistics)|biased]] assessment of how well the model will fit an independent data set.  This biased estimate is called the ''in-sample'' estimate of the fit, whereas the cross-validation estimate is an ''out-of-sample'' estimate.{{fact|date=November 2024}}

Since in linear regression it is possible to directly compute the factor (''n''&nbsp;&minus;&nbsp;''p''&nbsp;&minus;&nbsp;1)/(''n''&nbsp;+&nbsp;''p''&nbsp;+&nbsp;1) by which the training MSE underestimates the validation MSE under the assumption that the model specification is valid, cross-validation can be used for checking whether the model has been [[overfitting|overfitted]], in which case the MSE in the validation set will substantially exceed its anticipated value. (Cross-validation in the context of linear regression is also useful in that it can be used to select an optimally [[Regularization (mathematics)|regularized]] [[Loss function|cost function]].)