Editing Cross-validation (statistics) (section)

====Repeated random sub-sampling validation====

This method, also known as [[Monte Carlo method|Monte Carlo]] cross-validation,<ref>{{cite journal |last1=Xu |first1=Qing-Song |last2=Liang |first2=Yi-Zeng |title=Monte Carlo cross validation |journal=Chemometrics and Intelligent Laboratory Systems |date=April 2001 |volume=56 |issue=1 |pages=1–11 |doi=10.1016/S0169-7439(00)00122-2 }}</ref><ref name="mccv">{{cite book |doi=10.1007/978-0-387-47509-7_8 |chapter=Resampling Strategies for Model Assessment and Selection |title=Fundamentals of Data Mining in Genomics and Proteomics |date=2007 |last1=Simon |first1=Richard |pages=173–186 |isbn=978-0-387-47508-0 }}</ref> creates multiple random splits of the dataset into training and validation data.<ref>{{cite book |doi=10.1007/978-1-4614-6849-3 |title=Applied Predictive Modeling |date=2013 |last1=Kuhn |first1=Max |last2=Johnson |first2=Kjell |isbn=978-1-4614-6848-6 }}{{pn|date=November 2024}}</ref> For each such split, the model is fit to the training data, and predictive accuracy is assessed using the validation data. The results are then averaged over the splits. The advantage of this method (over ''k''-fold cross validation) is that the proportion of the training/validation split is not dependent on the number of iterations (i.e., the number of partitions). The disadvantage of this method is that some observations may never be selected in the validation subsample, whereas others may be selected more than once. In other words, validation subsets may overlap. This method also exhibits [[Monte Carlo method|Monte Carlo]] variation, meaning that the results will vary if the analysis is repeated with different random splits.

As the number of random splits approaches infinity, the result of repeated random sub-sampling validation tends towards that of leave-p-out cross-validation.

In a stratified variant of this approach, the random samples are generated in such a way that the mean response value (i.e. the dependent variable in the regression) is equal in the training and testing sets. This is particularly useful if the responses are [[dichotomous]] with an unbalanced representation of the two response values in the data.

A method that applies repeated random sub-sampling is [[RANSAC]].<ref>{{cite report |last1=Cantzler |first1=H |title=Random Sample Consensus (RANSAC) |url=https://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/CANTZLER2/ransac.pdf }}{{self-published inline|date=November 2024}}</ref>