Editing Data mining (section)

{{short description|Process of extracting and discovering patterns in large data sets}}
{{redirect|Web mining|web browser-based cryptocurrency mining|cryptocurrency}}
{{Machine learning bar}}
'''Data mining''' is the process of extracting and finding patterns in massive [[data set]]s involving methods at the intersection of [[machine learning]], [[statistics]], and [[database system]]s.<ref name="acm" /> Data mining is an [[interdisciplinary]] subfield of [[computer science]] and [[statistics]] with an overall goal of extracting information (with intelligent methods) from a data set and transforming the information into a comprehensible structure for further use.<ref name="acm">{{cite web |url=http://www.kdd.org/curriculum/index.html |title=Data Mining Curriculum |publisher=[[Association for Computing Machinery|ACM]] [[SIGKDD]] |date=2006-04-30 |access-date=2014-01-27 |archive-date=2013-10-14 |archive-url=https://web.archive.org/web/20131014213033/http://www.kdd.org/curriculum/index.html |url-status=live }}</ref><ref name="brittanica">{{cite web |last=Clifton |first=Christopher |title=Encyclopædia Britannica: Definition of Data Mining |year=2010 |url=https://www.britannica.com/EBchecked/topic/1056150/data-mining |access-date=2010-12-09 |archive-date=2011-02-05 |archive-url=https://web.archive.org/web/20110205121520/http://www.britannica.com/EBchecked/topic/1056150/data-mining |url-status=live }}</ref><ref name="elements">{{cite web|last1=Hastie|first1=Trevor|author-link1=Trevor Hastie|last2=Tibshirani|first2=Robert|author-link2=Robert Tibshirani|last3=Friedman|first3=Jerome|author-link3=Jerome H. Friedman|title=The Elements of Statistical Learning: Data Mining, Inference, and Prediction|year=2009|url=http://www-stat.stanford.edu/~tibs/ElemStatLearn/|access-date=2012-08-07|archive-url=https://web.archive.org/web/20091110212529/http://www-stat.stanford.edu/~tibs/ElemStatLearn/|archive-date=2009-11-10|url-status=dead}}</ref><ref>{{cite book|last1=Han|first1=Jaiwei|title=Data Mining: Concepts and Techniques|last2=Kamber|first2=Micheline|last3=Pei|first3=Jian|date=2011|publisher=Morgan Kaufmann|isbn=978-0-12-381479-1|edition=3rd|author-link=Jiawei Han}}</ref> Data mining is the analysis step of the "[[Knowledge discovery|knowledge discovery in databases]]" process, or KDD.<ref name="Fayyad" /> Aside from the raw analysis step, it also involves database and [[data management]] aspects, [[data pre-processing]], [[statistical model|model]] and [[Statistical inference|inference]] considerations, interestingness metrics, [[Computational complexity theory|complexity]] considerations, post-processing of discovered structures, [[Data and information visualization|visualization]], and [[Online algorithm|online updating]].<ref name="acm" />

The term "data mining" is a [[misnomer]] because the goal is the extraction of [[pattern]]s and knowledge from large amounts of data, not the [[data scraping|extraction (''mining'') of data itself]].<ref name="han-kamber">{{cite book|title=Data mining: concepts and techniques|last1=Han|first1=Jiawei|last2=Kamber|first2=Micheline|date=2001|publisher=[[Morgan Kaufmann]]|isbn=978-1-55860-489-6|page=5|quote=Thus, data mining should have been more appropriately named "knowledge mining from data," which is unfortunately somewhat long|author-link1=Jiawei Han}}</ref> It also is a [[buzzword]]<ref>[http://www.okairp.org/documents/2005%20Fall/F05_ROMEDataQualityETC.pdf OKAIRP 2005 Fall Conference, Arizona State University] {{Webarchive|url=https://web.archive.org/web/20140201170452/http://www.okairp.org/documents/2005%20Fall/F05_ROMEDataQualityETC.pdf|date=2014-02-01}}</ref> and is frequently applied to any form of large-scale data or [[Data processing|information processing]] ([[Data collection|collection]], [[information extraction|extraction]], [[Data warehouse|warehousing]], analysis, and statistics) as well as any application of [[Decision support system|computer decision support systems]], including [[artificial intelligence]] (e.g., machine learning) and [[business intelligence]]. Often the more general terms (''large scale'') ''[[data analysis]]'' and ''[[analytics]]''—or, when referring to actual methods, ''artificial intelligence'' and ''machine learning''—are more appropriate.

The actual data mining task is the semi-[[wikt:automatic|automatic]] or automatic analysis of massive quantities of data to extract previously unknown, interesting patterns such as groups of data records ([[cluster analysis]]), unusual records ([[anomaly detection]]), and [[Dependency (computer science)|dependencies]] ([[association rule mining]], [[sequential pattern mining]]). This usually involves using database techniques such as [[spatial index|spatial indices]]. These patterns can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning and [[predictive analytics]]. For example, the data mining step might identify multiple groups in the data, which can then be used to obtain more accurate prediction results by a [[decision support system]]. Neither the data collection, data preparation, nor result interpretation and reporting is part of the data mining step, although they do belong to the overall KDD process as additional steps.

The difference between [[data analysis]] and data mining is that data analysis is used to test models and hypotheses on the dataset, e.g., analyzing the effectiveness of a [[marketing campaign]], regardless of the amount of data. In contrast, data mining uses machine learning and statistical models to uncover clandestine or hidden patterns in a large volume of data.<ref>Olson, D. L. (2007). Data mining in business services. ''Service Business'', ''1''(3), 181–193. {{doi|10.1007/s11628-006-0014-7}}</ref>

The related terms ''[[data dredging]]'', ''data fishing'', and ''[[data snooping]]'' refer to the use of data mining methods to sample parts of a larger population data set that are (or may be) too small for reliable statistical inferences to be made about the validity of any patterns discovered. These methods can, however, be used in creating new hypotheses to test against the larger data populations.