Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Cross-validation (statistics)
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
====''k''-fold cross-validation==== [[File:KfoldCV.gif|right|thumb|350x350px|Illustration of k-fold cross-validation when n = 12 observations and k = 3. After data is shuffled, a total of 3 models will be trained and tested.]] In ''k''-fold cross-validation, the original sample is randomly partitioned into ''k'' equal sized subsamples, often referred to as "folds". Of the ''k'' subsamples, a single subsample is retained as the validation data for testing the model, and the remaining ''k'' β 1 subsamples are used as training data. The cross-validation process is then repeated ''k'' times, with each of the ''k'' subsamples used exactly once as the validation data. The ''k'' results can then be averaged to produce a single estimation. The advantage of this method over repeated random sub-sampling (see below) is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used,<ref name="McLachlan">{{cite book |doi=10.1002/047172842X |title=Analyzing Microarray Gene Expression Data |series=Wiley Series in Probability and Statistics |date=2004 |publisher=Wiley |isbn=978-0-471-22616-1 }}{{pn|date=November 2024}}</ref> but in general ''k'' remains an unfixed parameter. For example, setting ''k'' = ''2'' results in 2-fold cross-validation. In 2-fold cross-validation, we randomly shuffle the dataset into two sets ''d''<sub>0</sub> and ''d''<sub>1</sub>, so that both sets are equal size (this is usually implemented by shuffling the data array and then splitting it in two). We then train on ''d''<sub>0</sub> and validate on ''d''<sub>1</sub>, followed by training on ''d''<sub>1</sub> and validating on ''d''<sub>0</sub>. When ''k'' = ''n'' (the number of observations), ''k''-fold cross-validation is equivalent to leave-one-out cross-validation.<ref>{{cite book |doi=10.1007/978-0-387-84858-7 |title=The Elements of Statistical Learning |series=Springer Series in Statistics |date=2009 |isbn=978-0-387-84857-0 }}{{pn|date=November 2024}}</ref> In ''stratified'' ''k''-fold cross-validation, the partitions are selected so that the mean response value is approximately equal in all the partitions. In the case of binary classification, this means that each partition contains roughly the same proportions of the two types of class labels. In ''repeated'' cross-validation the data is randomly split into ''k'' partitions several times. The performance of the model can thereby be averaged over several runs, but this is rarely desirable in practice.<ref>{{cite book |last1=Vanwinckelen |first1=Gitte |last2=Blockeel |first2=Hendrik |chapter=On Estimating Model Accuracy with Repeated Cross-Validation |pages=39β44 |chapter-url=https://lirias.kuleuven.be/retrieve/186558 |date=2012 |title=BeneLearn 2012: Proceedings of the 21st Belgian-Dutch Conference on Machine Learning |isbn=978-94-6197-044-2 }}</ref> When many different statistical or [[Machine learning#Models|machine learning models]] are being considered, ''greedy'' ''k''-fold cross-validation can be used to quickly identify the most promising candidate models.<ref name="soper">{{cite journal |last1=Soper |first1=Daniel S. |title=Greed Is Good: Rapid Hyperparameter Optimization and Model Selection Using Greedy k-Fold Cross Validation |journal=Electronics |date=16 August 2021 |volume=10 |issue=16 |pages=1973 |doi=10.3390/electronics10161973 |doi-access=free }}</ref>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)