Editing Cross-validation (statistics) (section)

==Applications==

Cross-validation can be used to compare the performances of different predictive modeling procedures.  For example, suppose we are interested in [[optical character recognition]], and we are considering using either a [[Support Vector Machine]] (SVM) or [[K-nearest neighbors algorithm|''k''-nearest neighbors]] (KNN) to predict the true character from an image of a handwritten character.  Using cross-validation, we can obtain empirical estimates comparing these two methods in terms of their respective fractions of misclassified characters. In contrast, the in-sample estimate will not represent the quantity of interest (i.e. the generalization error).<ref>{{cite book |doi=10.1007/978-0-387-84858-7 |title=The Elements of Statistical Learning |series=Springer Series in Statistics |date=2009 |isbn=978-0-387-84857-0 }}{{pn|date=November 2024}}</ref>

Cross-validation can also be used in [[Feature selection|''variable selection'']].<ref name="Picard84">{{cite journal |last1=Picard |first1=Richard |last2=Cook |first2=Dennis |year=1984 |title=Cross-Validation of Regression Models |journal=Journal of the American Statistical Association |jstor=2288403 |volume=79 |pages=575–583 |doi=10.2307/2288403 |issue=387 }}</ref> Suppose we are using the [[gene expression|expression]] levels of 20 [[proteins]] to predict whether a [[cancer]] patient will respond to a [[drug]]. A practical goal would be to determine which subset of the 20 features should be used to produce the best predictive model. For most modeling procedures, if we compare feature subsets using the in-sample error rates, the best performance will occur when all 20 features are used. However under cross-validation, the model with the best fit will generally include only a subset of the features that are deemed truly informative.

A recent development in medical statistics is its use in meta-analysis. It forms the basis of the validation statistic, Vn which is used to test the statistical validity of meta-analysis summary estimates.<ref>{{cite journal |last1=Willis |first1=Brian H. |last2=Riley |first2=Richard D. |title=Measuring the statistical validity of summary meta-analysis and meta-regression results for use in clinical practice |journal=Statistics in Medicine |date=20 September 2017 |volume=36 |issue=21 |pages=3283–3301 |doi=10.1002/sim.7372 |pmid=28620945 |pmc=5575530 }}</ref>  It has also been used in a more conventional sense in meta-analysis to estimate the likely prediction error of meta-analysis results.<ref>{{cite journal |last1=Riley |first1=Richard D. |last2=Ahmed |first2=Ikhlaaq |last3=Debray |first3=Thomas P. A. |last4=Willis |first4=Brian H. |last5=Noordzij |first5=J. Pieter |last6=Higgins |first6=Julian P.T. |last7=Deeks |first7=Jonathan J. |title=Summarising and validating test accuracy results across multiple studies for use in clinical practice |journal=Statistics in Medicine |date=15 June 2015 |volume=34 |issue=13 |pages=2081–2103 |doi=10.1002/sim.6471 |pmid=25800943 |pmc=4973708 }}</ref>