Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Data mining
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
===Results validation=== [[File:Spurious correlations - spelling bee spiders.svg|thumb|upright=1.75|An example of data produced by [[data dredging]] through a bot operated by statistician Tyler Vigen, apparently showing a close link between the best word winning a spelling bee competition and the number of people in the United States killed by venomous spiders]]Data mining can unintentionally be misused, producing results that appear to be significant but which do not actually predict future behavior and cannot be [[Reproducibility|reproduced]] on a new sample of data, therefore bearing little use. This is sometimes caused by investigating too many hypotheses and not performing proper [[statistical hypothesis testing]]. A simple version of this problem in [[machine learning]] is known as [[overfitting]], but the same problem can arise at different phases of the process and thus a train/test split—when applicable at all—may not be sufficient to prevent this from happening.<ref name="hawkins">{{cite journal | last1 = Hawkins | first1 = Douglas M | year = 2004 | title = The problem of overfitting | journal = Journal of Chemical Information and Computer Sciences | volume = 44 | issue = 1| pages = 1–12 | doi=10.1021/ci0342472| pmid = 14741005 | s2cid = 12440383 }}</ref> The final step of knowledge discovery from data is to verify that the patterns produced by the data mining algorithms occur in the wider data set. Not all patterns found by the algorithms are necessarily valid. It is common for data mining algorithms to find patterns in the training set which are not present in the general data set. This is called [[overfitting]]. To overcome this, the evaluation uses a [[test set]] of data on which the data mining algorithm was not trained. The learned patterns are applied to this test set, and the resulting output is compared to the desired output. For example, a data mining algorithm trying to distinguish "spam" from "legitimate" e-mails would be trained on a [[training set]] of sample e-mails. Once trained, the learned patterns would be applied to the test set of e-mails on which it had ''not'' been trained. The accuracy of the patterns can then be measured from how many e-mails they correctly classify. Several statistical methods may be used to evaluate the algorithm, such as [[Receiver operating characteristic|ROC curves]]. If the learned patterns do not meet the desired standards, it is necessary to re-evaluate and change the pre-processing and data mining steps. If the learned patterns do meet the desired standards, then the final step is to interpret the learned patterns and turn them into knowledge.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)