Editing Overfitting (section)

==Machine learning==
[[Image:Overfitting svg.svg|thumb|300px|Figure 4. Overfitting/overtraining in supervised learning (e.g., a [[Artificial neural network|neural network]]). Training error is shown in blue, and validation error in red, both as a function of the number of training cycles. If the validation error increases (positive slope) while the training error steadily decreases (negative slope), then a situation of overfitting may have occurred. The best predictive and fitted model would be where the validation error has its global minimum.]]

Usually, a learning [[algorithm]] is trained using some set of "training data": exemplary situations for which the desired output is known. The goal is that the algorithm will also perform well on predicting the output when fed "validation data" that was not encountered during its training.

Overfitting is the use of models or procedures that violate [[Occam's razor]], for example by including more adjustable parameters than are ultimately optimal, or by using a more complicated approach than is ultimately optimal. For an example where there are too many adjustable parameters, consider a dataset where training data for {{mvar|y}} can be adequately predicted by a linear function of two independent variables. Such a function requires only three parameters (the intercept and two slopes). Replacing this simple function with a new, more complex quadratic function, or with a new, more complex linear function on more than two independent variables, carries a risk: Occam's razor implies that any given complex function is ''a priori'' less probable than any given simple function. If the new, more complicated function is selected instead of the simple function, and if there was not a large enough gain in training data fit to offset the complexity increase, then the new complex function "overfits" the data and the complex overfitted function will likely perform worse than the simpler function on validation data outside the training dataset, even though the complex function performed as well, or perhaps even better, on the training dataset.<ref name=hawkins>{{cite journal | last1 = Hawkins | first1 = Douglas M | year = 2004 | title = The problem of overfitting | journal = [[Journal of Chemical Information and Modeling]] | volume = 44 | issue = 1| pages = 1–12 | doi = 10.1021/ci0342472 | pmid = 14741005 | s2cid = 12440383 }}</ref>

When comparing different types of models, complexity cannot be measured solely by counting how many parameters exist in each model; the expressivity of each parameter must be considered as well. For example, it is nontrivial to directly compare the complexity of a neural net (which can track curvilinear relationships) with {{mvar|m}} parameters to a regression model with {{mvar|n}} parameters.<ref name=hawkins />

Overfitting is especially likely in cases where learning was performed too long or where training examples are rare, causing the learner to adjust to very specific random features of the training data that have no [[causal relation]] to the [[Function approximation|target function]]. In this process of overfitting, the performance on the training examples still increases while the performance on unseen data becomes worse.

As a simple example, consider a database of retail purchases that includes the item bought, the purchaser, and the date and time of purchase. It's easy to construct a model that will fit the training set perfectly by using the date and time of purchase to predict the other attributes, but this model will not generalize at all to new data because those past times will never occur again.

Generally, a learning algorithm is said to overfit relative to a simpler one if it is more accurate in fitting known data (hindsight) but less accurate in predicting new data (foresight). One can intuitively understand overfitting from the fact that information from all past experience can be divided into two groups: information that is relevant for the future, and irrelevant information ("noise"). Everything else being equal, the more difficult a criterion is to predict (i.e., the higher its uncertainty), the more noise exists in past information that needs to be ignored. The problem is determining which part to ignore. A learning algorithm that can reduce the risk of fitting noise is called "[[Robustness (computer science)#Robust machine learning|robust]]."

===Consequences===
{{multiple image
| total_width       = 300
| footer            = Overfitted [[generative model]]s may produce outputs that are virtually identical to instances from their training set.<ref name=earthquake/>
| image1            = Anne Graham Lotz (October 2008).jpg
| alt1              = 
| caption1          = A photograph of [[Anne Graham Lotz]] included in the training set of [[Stable Diffusion]], a [[text-to-image model]]
| image2            = Ann graham lotz stable diffusion.webp
| alt2              = 
| caption2          = An image generated by Stable Diffusion using the prompt "Anne Graham Lotz"
}}
The most obvious consequence of overfitting is poor performance on the validation dataset. Other negative consequences include:

* A function that is overfitted is likely to request more information about each item in the validation dataset than does the optimal function; gathering this additional unneeded data can be expensive or error-prone, especially if each individual piece of information must be gathered by human observation and manual data entry.<ref name=hawkins />
* A more complex, overfitted function is likely to be less portable than a simple one. At one extreme, a one-variable linear regression is so portable that, if necessary, it could even be done by hand. At the other extreme are models that can be reproduced only by exactly duplicating the original modeler's entire setup, making reuse or scientific reproduction difficult.<ref name=hawkins />
* It may be possible to reconstruct details of individual training instances from an overfitted machine learning model's training set. This may be undesirable if, for example, the training data includes sensitive [[personally identifiable information]] (PII). This phenomenon also presents problems in the area of [[artificial intelligence and copyright]], with the developers of some generative deep learning models such as [[Stable Diffusion]] and [[GitHub Copilot]] being sued for copyright infringement because these models have been found to be capable of reproducing certain copyrighted items from their training data.<ref name=earthquake>{{cite web
|work=Ars Technica
|last=Lee|first=Timothy B.
|date=3 April 2023
|title=Stable Diffusion copyright lawsuits could be a legal earthquake for AI
|url=https://arstechnica.com/tech-policy/2023/04/stable-diffusion-copyright-lawsuits-could-be-a-legal-earthquake-for-ai/
}}</ref><ref name="Verge copilot">{{Cite web |last=Vincent |first=James |date=2022-11-08 |title=The lawsuit that could rewrite the rules of AI copyright |url=https://www.theverge.com/2022/11/8/23446821/microsoft-openai-github-copilot-class-action-lawsuit-ai-copyright-violation-training-data |access-date=2022-12-07 |website=The Verge |language=en-US}}</ref>

===Remedy===
The optimal function usually needs verification on bigger or completely new datasets.  There are, however, methods like [[minimum spanning tree]] or [[life-time of correlation]] that applies the dependence between correlation coefficients and time-series (window width). Whenever the window width is big enough, the correlation coefficients are stable and don't depend on the window width size anymore. Therefore, a correlation matrix can be created by calculating a coefficient of correlation between investigated variables. This matrix can be represented topologically as a complex network where direct and indirect influences between variables are visualized. 

Dropout regularisation (random removal of training set data) can also improve robustness and therefore reduce over-fitting by probabilistically removing inputs to a layer.