Editing Cross-validation (statistics) (section)

==Limitations and misuse==

Cross-validation only yields meaningful results if the validation set and training set are drawn from the same population and only if human biases are controlled.

In many applications of predictive modeling, the structure of the system being studied evolves over time (i.e. it is "non-stationary").  Both of these can introduce systematic differences between the training and validation sets.  For example, if a model for prediction of trend changes in financial quotations is trained on data for a certain five-year period, it is unrealistic to treat the subsequent five-year period as a draw from the same population.  As another example, suppose a model is developed to predict an individual's risk for being [[medical diagnosis|diagnosed]] with a particular disease within the next year.  If the model is trained using data from a study involving only a specific population group (e.g. young people or males), but is then applied to the general population, the cross-validation results from the training set could differ greatly from the actual predictive performance.

In many applications, models also may be incorrectly specified and vary as a function of modeler biases and/or arbitrary choices. When this occurs, there may be an illusion that the system changes in external samples, whereas the reason is that the model has missed a critical predictor and/or included a confounded predictor.   New evidence is that cross-validation by itself is not very predictive of external validity, whereas a form of experimental validation known as swap sampling that does control for human bias can be much more predictive of external validity.<ref>{{cite journal |doi=10.1038/nbt.1665 |title=The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models |journal=Nature Biotechnology |date=2010 |volume=28 |issue=8 |pages=827–838 |pmid=20676074 |last1=Shi |first1=L. |last2=Campbell |first2=G. |last3=Jones |first3=W. D. |last4=Campagne |first4=F. |last5=Wen |first5=Z. |last6=Walker |first6=S. J. |last7=Su |first7=Z. |last8=Chu |first8=T. M. |last9=Goodsaid |first9=F. M. |last10=Pusztai |first10=L. |last11=Shaughnessy Jr |first11=J. D. |last12=Oberthuer |first12=A. |last13=Thomas |first13=R. S. |last14=Paules |first14=R. S. |last15=Fielden |first15=M. |last16=Barlogie |first16=B. |last17=Chen |first17=W. |last18=Du |first18=P. |last19=Fischer |first19=M. |last20=Furlanello |first20=C. |last21=Gallas |first21=B. D. |last22=Ge |first22=X. |last23=Megherbi |first23=D. B. |last24=Symmans |first24=W. F. |last25=Wang |first25=M. D. |last26=Zhang |first26=J. |last27=Bitter |first27=H. |last28=Brors |first28=B. |last29=Bushel |first29=P. R. |last30=Bylesjo |first30=M. |pmc=3315840 |display-authors=1 }}</ref>  As defined by this large MAQC-II study across 30,000 models, swap sampling incorporates cross-validation in the sense that predictions are tested across independent training and validation samples. Yet, models are also developed across these independent samples and by modelers who are blinded to one another.  When there is a mismatch in these models developed across these swapped training and validation samples as happens quite frequently, MAQC-II shows that this will be much more predictive of poor external predictive validity than traditional cross-validation.

The reason for the success of the swapped sampling is a built-in control for human biases in model building.  In addition to placing too much faith in predictions that may vary across modelers and lead to poor external validity due to these confounding modeler effects, these are some other ways that cross-validation can be misused:

* By performing an initial analysis to identify the most informative [[features (pattern recognition)|features]] using the entire data set – if feature selection or model tuning is required by the modeling procedure, this must be repeated on every training set. Otherwise, predictions will certainly be upwardly biased.<ref name="Bermingham-intro">{{cite journal |last1=Bermingham |first1=M. L. |last2=Pong-Wong |first2=R. |last3=Spiliopoulou |first3=A. |last4=Hayward |first4=C. |last5=Rudan |first5=I. |last6=Campbell |first6=H. |last7=Wright |first7=A. F. |last8=Wilson |first8=J. F. |last9=Agakov |first9=F. |last10=Navarro |first10=P. |last11=Haley |first11=C. S. |title=Application of high-dimensional feature selection: evaluation for genomic prediction in man |journal=Scientific Reports |date=19 May 2015 |volume=5 |issue=1 |page=10312 |doi=10.1038/srep10312 |pmid=25988841 |pmc=4437376 |bibcode=2015NatSR...510312B }}</ref>  If cross-validation is used to decide which features to use, an ''inner cross-validation'' to carry out the feature selection on every training set must be performed.<ref>{{cite journal |last1=Varma |first1=Sudhir |last2=Simon |first2=Richard |title=Bias in error estimation when using cross-validation for model selection |journal=BMC Bioinformatics |date=December 2006 |volume=7 |issue=1 |page=91 |doi=10.1186/1471-2105-7-91 |pmid=16504092 |pmc=1397873 |doi-access=free }}</ref>
* Performing mean-centering, rescaling, dimensionality reduction, outlier removal or any other data-dependent preprocessing using the entire data set. While very common in practice, this has been shown to introduce biases into the cross-validation estimates.<ref>{{cite journal |last1=Moscovich |first1=Amit |last2=Rosset |first2=Saharon |title=On the Cross-Validation Bias due to Unsupervised Preprocessing |journal=Journal of the Royal Statistical Society Series B: Statistical Methodology |date=September 2022 |volume=84 |issue=4 |pages=1474–1502 |doi=10.1111/rssb.12537 |arxiv=1901.08974 }}</ref>
* By allowing some of the training data to also be included in the test set – this can happen due to "twinning" in the data set, whereby some exactly identical or nearly identical samples are present in the data set, see [[pseudoreplication]]. To some extent twinning always takes place even in perfectly independent training and validation samples. This is because some of the training sample observations will have nearly identical values of predictors as validation sample observations. And some of these will correlate with a target at better than chance levels in the same direction in both training and validation when they are actually driven by confounded predictors with poor external validity.  If such a cross-validated model is selected from a ''k''-fold set, human [[confirmation bias]] will be at work and determine that such a model has been validated. This is why traditional cross-validation needs to be supplemented with controls for human bias and confounded model specification like swap sampling and prospective studies.