Editing Overfitting (section)

{{Short description|Flaw in mathematical  modelling}}
{{Refimprove|date=August 2017}}
{{Machine learning}}

[[Image:Overfitting.svg|thumb|300px|Figure 1.&nbsp; The green line represents an overfitted model and the black line represents a regularized model. While the green line best follows the training data, it is too dependent on that data and is likely to have a higher error rate on new unseen data, illustrated by black-outlined dots, compared to the black line.]]
[[File:Pyplot overfitting.png|thumb|300x300px|Figure 2.&nbsp; Noisy (roughly linear) data is fitted to a linear function and a [[polynomial]] function. Although the polynomial function is a perfect fit, the linear function can be expected to generalize better: If the two functions were used to extrapolate beyond the fitted data, the linear function should make better predictions.]]
[[Image:Parabola_on_line.png|thumb|300px|Figure 3.&nbsp; The blue dashed line represents an underfitted model. A straight line can never fit a parabola. This model is too simple.]]

<span lang="English ">In</span> mathematical modeling, '''overfitting''' is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably".<ref>Definition of "[https://web.archive.org/web/20171107014257/https://en.oxforddictionaries.com/definition/overfitting overfitting]" at [[OxfordDictionaries.com]]: this definition is specifically for statistics.</ref> An '''overfitted model''' is a [[mathematical model]] that contains more [[parameter]]s than can be justified by the data.<ref name=CDS/> In the special case where the model consists of a polynomial function, these parameters represent the [[degree of a polynomial]]. The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e., the [[Statistical noise|noise]]) as if that variation represented underlying model structure.<ref name="BA2002" />{{rp|45}}

'''Underfitting''' occurs when a mathematical model cannot adequately capture the underlying structure of the data. An '''under-fitted model''' is a model where some parameters or terms that would appear in a correctly specified model are missing.<ref name=CDS/> Underfitting would occur, for example, when fitting a linear model to nonlinear data. Such a model will tend to have poor predictive performance.

The possibility of over-fitting exists because the criterion used for [[model selection|selecting the model]] is not the same as the criterion used to judge the suitability of a model. For example, a model might be selected by maximizing its performance on some set of [[training data]], and yet its suitability might be determined by its ability to perform well on unseen data; overfitting occurs when a model begins to "memorize" training data rather than "learning" to generalize from a trend. 

As an extreme example, if the number of parameters is the same as or greater than the number of observations, then a model can perfectly predict the training data simply by memorizing the data in its entirety. (For an illustration, see Figure&nbsp;2.) Such a model, though, will typically fail severely when making predictions. 

Overfitting is directly related to approximation error of the selected function class and the optimization error of the optimization procedure. A function class that is too large, in a suitable sense, relative to the dataset size is likely to overfit.<ref>{{Citation |last1=Bottou |first1=Léon |title=The Tradeoffs of Large-Scale Learning |date=2011-09-30 |url=http://dx.doi.org/10.7551/mitpress/8996.003.0015 |work=Optimization for Machine Learning |pages=351–368 |access-date=2023-12-08 |publisher=The MIT Press |isbn=978-0-262-29877-3 |last2=Bousquet |first2=Olivier|doi=10.7551/mitpress/8996.003.0015 }}</ref> Even when the fitted model does not have an excessive number of parameters, it is to be expected that the fitted relationship will appear to perform less well on a new dataset than on the dataset used for fitting (a phenomenon sometimes known as ''shrinkage'').<ref name="CDS">Everitt B.S., Skrondal A. (2010), ''Cambridge Dictionary of Statistics'', [[Cambridge University Press]].</ref> In particular, the value of the [[coefficient of determination]] will [[Shrinkage (statistics)|shrink]] relative to the original data.

To lessen the chance or amount of overfitting, several techniques are available (e.g., [[Model selection|model comparison]], [[cross-validation (statistics)|cross-validation]], [[regularization (mathematics)|regularization]], [[early stopping]], [[pruning (algorithm)|pruning]], [[Prior distribution|Bayesian priors]], or [[Dropout (neural networks)|dropout]]). The basis of some techniques is to either (1) explicitly penalize overly complex models or (2) test the model's ability to generalize by evaluating its performance on a set of data not used for training, which is assumed to approximate the typical unseen data that a model will encounter.