Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Cross-validation (statistics)
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
===Example: linear regression === In linear regression, there exist [[real number|real]] ''response values'' <math display="inline">y_1,\ldots, y_n </math>, and ''n'' ''p''-dimensional [[Euclidean vector|vector]] ''covariates'' '''''x'''''<sub>1</sub>, ..., '''''x'''''<sub>''n''</sub>. The components of the vector '''''x'''''<sub>''i''</sub> are denoted ''x''<sub>''i''1</sub>, ..., ''x''<sub>''ip''</sub>. If [[least squares]] is used to fit a function in the form of a [[hyperplane]] '''''Ε·''''' = ''a'' + '''''Ξ²'''''<sup>T</sup>'''''x''''' to the data ('''''x'''''<sub>''i''</sub>, ''y''<sub>''i''</sub>)<sub> 1 β€ ''i'' β€ ''n''</sub>, then the fit can be assessed using the [[mean squared error]] (MSE). The MSE for given estimated parameter values ''a'' and '''''Ξ²''''' on the training set ('''''x'''''<sub>''i''</sub>, ''y''<sub>''i''</sub>)<sub> 1 β€ ''i'' β€ ''n''</sub> is defined as: :<math>\begin{align} \text{MSE} &= \frac 1 n \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \frac 1 n \sum_{i=1}^n (y_i - a - \boldsymbol\beta^T \mathbf{x}_i)^2\\&= \frac{1}{n}\sum_{i=1}^n (y_i - a - \beta_1x_{i1} - \dots - \beta_px_{ip})^2 \end{align}</math> If the model is correctly specified, it can be shown under mild assumptions that the [[expected value]] of the MSE for the training set is (''n'' − ''p'' − 1)/(''n'' + ''p'' + 1) < 1 times the expected value of the MSE for the validation set (the expected value is taken over the distribution of training sets). Thus, a fitted model and computed MSE on the training set will result in an optimistically [[bias (statistics)|biased]] assessment of how well the model will fit an independent data set. This biased estimate is called the ''in-sample'' estimate of the fit, whereas the cross-validation estimate is an ''out-of-sample'' estimate.{{fact|date=November 2024}} Since in linear regression it is possible to directly compute the factor (''n'' − ''p'' − 1)/(''n'' + ''p'' + 1) by which the training MSE underestimates the validation MSE under the assumption that the model specification is valid, cross-validation can be used for checking whether the model has been [[overfitting|overfitted]], in which case the MSE in the validation set will substantially exceed its anticipated value. (Cross-validation in the context of linear regression is also useful in that it can be used to select an optimally [[Regularization (mathematics)|regularized]] [[Loss function|cost function]].)
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)