Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Cross-validation (statistics)
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Motivation== Assume a [[statistical model|model]] with one or more unknown [[parameters]], and a data set to which the model can be fit (the training data set). The fitting process [[optimization (mathematics)|optimizes]] the model parameters to make the model fit the training data as well as possible. If an [[independence (probability theory)|independent]] sample of validation data is taken from the same [[statistical population|population]] as the training data, it will generally turn out that the model does not fit the validation data as well as it fits the training data. The size of this difference is likely to be large especially when the size of the training data set is small, or when the number of parameters in the model is large. Cross-validation is a way to estimate the size of this effect.{{fact|date=November 2024}} ===Example: linear regression === In linear regression, there exist [[real number|real]] ''response values'' <math display="inline">y_1,\ldots, y_n </math>, and ''n'' ''p''-dimensional [[Euclidean vector|vector]] ''covariates'' '''''x'''''<sub>1</sub>, ..., '''''x'''''<sub>''n''</sub>. The components of the vector '''''x'''''<sub>''i''</sub> are denoted ''x''<sub>''i''1</sub>, ..., ''x''<sub>''ip''</sub>. If [[least squares]] is used to fit a function in the form of a [[hyperplane]] '''''Ε·''''' = ''a'' + '''''Ξ²'''''<sup>T</sup>'''''x''''' to the data ('''''x'''''<sub>''i''</sub>, ''y''<sub>''i''</sub>)<sub> 1 β€ ''i'' β€ ''n''</sub>, then the fit can be assessed using the [[mean squared error]] (MSE). The MSE for given estimated parameter values ''a'' and '''''Ξ²''''' on the training set ('''''x'''''<sub>''i''</sub>, ''y''<sub>''i''</sub>)<sub> 1 β€ ''i'' β€ ''n''</sub> is defined as: :<math>\begin{align} \text{MSE} &= \frac 1 n \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \frac 1 n \sum_{i=1}^n (y_i - a - \boldsymbol\beta^T \mathbf{x}_i)^2\\&= \frac{1}{n}\sum_{i=1}^n (y_i - a - \beta_1x_{i1} - \dots - \beta_px_{ip})^2 \end{align}</math> If the model is correctly specified, it can be shown under mild assumptions that the [[expected value]] of the MSE for the training set is (''n'' − ''p'' − 1)/(''n'' + ''p'' + 1) < 1 times the expected value of the MSE for the validation set (the expected value is taken over the distribution of training sets). Thus, a fitted model and computed MSE on the training set will result in an optimistically [[bias (statistics)|biased]] assessment of how well the model will fit an independent data set. This biased estimate is called the ''in-sample'' estimate of the fit, whereas the cross-validation estimate is an ''out-of-sample'' estimate.{{fact|date=November 2024}} Since in linear regression it is possible to directly compute the factor (''n'' − ''p'' − 1)/(''n'' + ''p'' + 1) by which the training MSE underestimates the validation MSE under the assumption that the model specification is valid, cross-validation can be used for checking whether the model has been [[overfitting|overfitted]], in which case the MSE in the validation set will substantially exceed its anticipated value. (Cross-validation in the context of linear regression is also useful in that it can be used to select an optimally [[Regularization (mathematics)|regularized]] [[Loss function|cost function]].) === General case=== In most other regression procedures (e.g. [[logistic regression]]), there is no simple formula to compute the expected out-of-sample fit. Cross-validation is, thus, a generally applicable way to predict the performance of a model on unavailable data using numerical computation in place of theoretical analysis.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)