Editing Data mining (section)

===Results validation===
[[File:Spurious correlations - spelling bee spiders.svg|thumb|upright=1.75|An example of data produced by [[data dredging]] through a bot operated by statistician Tyler Vigen, apparently showing a close link between the best word winning a spelling bee competition and the number of people in the United States killed by venomous spiders]]Data mining can unintentionally be misused, producing results that appear to be significant but which do not actually predict future behavior and cannot be [[Reproducibility|reproduced]] on a new sample of data, therefore bearing little use. This is sometimes caused by investigating too many hypotheses and not performing proper [[statistical hypothesis testing]]. A simple version of this problem in [[machine learning]] is known as [[overfitting]], but the same problem can arise at different phases of the process and thus a train/test split—when applicable at all—may not be sufficient to prevent this from happening.<ref name="hawkins">{{cite journal | last1 = Hawkins | first1 = Douglas M | year = 2004 | title = The problem of overfitting | journal = Journal of Chemical Information and Computer Sciences | volume = 44 | issue = 1| pages = 1–12 | doi=10.1021/ci0342472| pmid = 14741005 | s2cid = 12440383 }}</ref>

The final step of knowledge discovery from data is to verify that the patterns produced by the data mining algorithms occur in the wider data set. Not all patterns found by the algorithms are necessarily valid. It is common for data mining algorithms to find patterns in the training set which are not present in the general data set. This is called [[overfitting]]. To overcome this, the evaluation uses a [[test set]] of data on which the data mining algorithm was not trained. The learned patterns are applied to this test set, and the resulting output is compared to the desired output. For example, a data mining algorithm trying to distinguish "spam" from "legitimate" e-mails would be trained on a [[training set]] of sample e-mails. Once trained, the learned patterns would be applied to the test set of e-mails on which it had ''not'' been trained. The accuracy of the patterns can then be measured from how many e-mails they correctly classify. Several statistical methods may be used to evaluate the algorithm, such as [[Receiver operating characteristic|ROC curves]].

If the learned patterns do not meet the desired standards, it is necessary to re-evaluate and change the pre-processing and data mining steps. If the learned patterns do meet the desired standards, then the final step is to interpret the learned patterns and turn them into knowledge.