Editing Data mining (section)

==Process==
The ''knowledge discovery in databases (KDD) process'' is commonly defined with the stages:

# Selection
# Pre-processing
# Transformation
# ''Data mining''
# Interpretation/evaluation.<ref name="Fayyad" />

It exists, however, in many variations on this theme, such as the [[Cross-industry standard process for data mining]] (CRISP-DM) which defines six phases:

# Business understanding
# Data understanding
# Data preparation
# Modeling
# Evaluation
# [[System deployment|Deployment]]

or a simplified process such as (1) Pre-processing, (2) Data Mining, and (3) Results Validation.

Polls conducted in 2002, 2004, 2007 and 2014 show that the CRISP-DM methodology is the leading methodology used by data miners.<ref name=KDN_1>{{cite web| title=What main methodology are you using for data mining (2002)?| url=https://www.kdnuggets.com/polls/2002/methodology.htm| publisher=[[KDnuggets]]| date=2002| access-date=29 December 2023| url-status=live| archive-date=16 January 2017| archive-url=https://web.archive.org/web/20170116195014/http://www.kdnuggets.com/polls/2002/methodology.htm}}</ref><ref name=KDN_2>{{cite web| title=What main methodology are you using for data mining (2004)?| url=https://www.kdnuggets.com/polls/2004/data_mining_methodology.htm| publisher=[[KDnuggets]]| date=2004| access-date=29 December 2023| url-status=live| archive-date=8 February 2017| archive-url=https://web.archive.org/web/20170208085109/http://www.kdnuggets.com/polls/2004/data_mining_methodology.htm}}</ref><ref name=KDN_3>{{cite web| title=What main methodology are you using for data mining (2007)?| url=http://www.kdnuggets.com/polls/2007/data_mining_methodology.htm| publisher=[[KDnuggets]]| date=2007| access-date=29 December 2023| url-status=live| archive-date=17 November 2012| archive-url=https://web.archive.org/web/20121117003400/http://www.kdnuggets.com/polls/2007/data_mining_methodology.htm}}</ref><ref name=KDN_4>{{cite web| title=What main methodology are you using for data mining (2014)?| url=https://www.kdnuggets.com/polls/2014/analytics-data-mining-data-science-methodology.html| publisher=[[KDnuggets]]| date=2014| access-date=29 December 2023| url-status=live| archive-date=1 August 2016| archive-url=https://web.archive.org/web/20160801220617/http://kdnuggets.com/polls/2014/analytics-data-mining-data-science-methodology.html}}</ref>

The only other data mining standard named in these polls was [[SEMMA]]. However, 3–4 times as many people reported using CRISP-DM. Several teams of researchers have published reviews of data mining process models,<ref name="kurgan">Lukasz Kurgan and Petr Musilek: [http://journals.cambridge.org/action/displayAbstract?fromPage=online&aid=451120 "A survey of Knowledge Discovery and Data Mining process models"] {{Webarchive|url=https://web.archive.org/web/20130526234755/http://journals.cambridge.org/action/displayAbstract?fromPage=online&aid=451120 |date=2013-05-26 }}. ''The Knowledge Engineering Review''. Volume 21 Issue 1, March 2006, pp&nbsp;1–24, Cambridge University Press, New York, {{doi|10.1017/S0269888906000737}}</ref> and Azevedo and Santos conducted a comparison of CRISP-DM and SEMMA in 2008.<ref name="AzevedoSantos">Azevedo, A. and Santos, M. F. [http://www.iadis.net/dl/final_uploads/200812P033.pdf KDD, SEMMA and CRISP-DM: a parallel overview] {{webarchive|url=https://web.archive.org/web/20130109114939/http://www.iadis.net/dl/final_uploads/200812P033.pdf |date=2013-01-09 }}. In Proceedings of the IADIS European Conference on Data Mining 2008, pp&nbsp;182–185.</ref>

===Pre-processing===
Before data mining algorithms can be used, a target data set must be assembled. As data mining can only uncover patterns actually present in the data, the target data set must be large enough to contain these patterns while remaining concise enough to be mined within an acceptable time limit. A common source for data is a [[data mart]] or [[data warehouse]]. Pre-processing is essential to analyze the [[Multivariate statistics|multivariate]] data sets before data mining. The target set is then cleaned. Data cleaning removes the observations containing [[statistical noise|noise]] and those with [[missing data]].

===Data mining===
Data mining involves six common classes of tasks:<ref name="Fayyad">{{cite web |last1=Fayyad |first1=Usama |author-link1=Usama Fayyad |last2=Piatetsky-Shapiro |first2=Gregory|author-link2=Gregory Piatetsky-Shapiro |last3=Smyth |first3=Padhraic |title=From Data Mining to Knowledge Discovery in Databases |year=1996 |url=http://www.kdnuggets.com/gpspubs/aimag-kdd-overview-1996-Fayyad.pdf |archive-url=https://ghostarchive.org/archive/20221009/http://www.kdnuggets.com/gpspubs/aimag-kdd-overview-1996-Fayyad.pdf |archive-date=2022-10-09 |url-status=live |access-date = 17 December 2008 }}</ref>

* [[Anomaly detection]] (outlier/change/deviation detection) – The identification of unusual data records, that might be interesting or data errors that require further investigation due to being out of standard range.
* [[Association rule learning]] (dependency modeling) – Searches for relationships between variables. For example, a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis.
* [[Cluster analysis|Clustering]] – is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data.
* [[Statistical classification|Classification]] – is the task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam".
* [[Regression analysis|Regression]] – attempts to find a function that models the data with the least error that is, for estimating the relationships among data or datasets.
* [[Automatic summarization|Summarization]] – providing a more compact representation of the data set, including visualization and report generation.

===Results validation===
[[File:Spurious correlations - spelling bee spiders.svg|thumb|upright=1.75|An example of data produced by [[data dredging]] through a bot operated by statistician Tyler Vigen, apparently showing a close link between the best word winning a spelling bee competition and the number of people in the United States killed by venomous spiders]]Data mining can unintentionally be misused, producing results that appear to be significant but which do not actually predict future behavior and cannot be [[Reproducibility|reproduced]] on a new sample of data, therefore bearing little use. This is sometimes caused by investigating too many hypotheses and not performing proper [[statistical hypothesis testing]]. A simple version of this problem in [[machine learning]] is known as [[overfitting]], but the same problem can arise at different phases of the process and thus a train/test split—when applicable at all—may not be sufficient to prevent this from happening.<ref name="hawkins">{{cite journal | last1 = Hawkins | first1 = Douglas M | year = 2004 | title = The problem of overfitting | journal = Journal of Chemical Information and Computer Sciences | volume = 44 | issue = 1| pages = 1–12 | doi=10.1021/ci0342472| pmid = 14741005 | s2cid = 12440383 }}</ref>

The final step of knowledge discovery from data is to verify that the patterns produced by the data mining algorithms occur in the wider data set. Not all patterns found by the algorithms are necessarily valid. It is common for data mining algorithms to find patterns in the training set which are not present in the general data set. This is called [[overfitting]]. To overcome this, the evaluation uses a [[test set]] of data on which the data mining algorithm was not trained. The learned patterns are applied to this test set, and the resulting output is compared to the desired output. For example, a data mining algorithm trying to distinguish "spam" from "legitimate" e-mails would be trained on a [[training set]] of sample e-mails. Once trained, the learned patterns would be applied to the test set of e-mails on which it had ''not'' been trained. The accuracy of the patterns can then be measured from how many e-mails they correctly classify. Several statistical methods may be used to evaluate the algorithm, such as [[Receiver operating characteristic|ROC curves]].

If the learned patterns do not meet the desired standards, it is necessary to re-evaluate and change the pre-processing and data mining steps. If the learned patterns do meet the desired standards, then the final step is to interpret the learned patterns and turn them into knowledge.