Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Data dredging
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
{{Short description|Misuse of data analysis}} [[File:Spurious correlations - spelling bee spiders.svg|thumb|upright=1.3|A humorous example of a result produced by data dredging, showing a correlation between the number of letters in [[Scripps National Spelling Bee]]'s winning word and the number of people in the United States killed by [[venomous spiders]]]] '''Data dredging''' (also known as '''data snooping''' or '''''p''-hacking''')<ref name="Wasserstein2016">{{cite journal | last1=Wasserstein | first1=Ronald L. | last2=Lazar | first2=Nicole A. | title=The ASA Statement on p-Values: Context, Process, and Purpose | journal=The American Statistician | publisher=Informa UK Limited | volume=70 | issue=2 | date=2016-04-02 | issn=0003-1305 | doi=10.1080/00031305.2016.1154108 | pages=129β133| doi-access=free }}</ref>{{efn|Other names include data grubbing,<!--per Smith 2014 and others--> data butchery, data fishing, selective inference, significance chasing, and significance questing.}} is the misuse of [[data analysis]] to find patterns in data that can be presented as [[statistically significant]], thus dramatically increasing and understating the risk of [[false positives]]. This is done by performing many [[statistical test]]s on the data and only reporting those that come back with significant results.<ref name="bmj02" /> Thus data dredging is also often a misused or misapplied form of [[data mining]]. The process of data dredging involves testing multiple hypotheses using a single [[data set]] by [[Brute-force search|exhaustively searching]]βperhaps for combinations of variables that might show a [[correlation]], and perhaps for groups of cases or observations that show differences in their mean or in their breakdown by some other variable. Conventional tests of [[statistical significance]] are based on the probability that a particular result would arise if chance alone were at work, and necessarily accept some risk of [[Type I error|mistaken conclusions of a certain type]] (mistaken rejections of the [[null hypothesis]]). This level of risk is called the [[statistical significance|''significance'']]. When large numbers of tests are performed, some produce false results of this type; hence 5% of randomly chosen hypotheses might be (erroneously) reported to be statistically significant at the 5% significance level, 1% might be (erroneously) reported to be statistically significant at the 1% significance level, and so on, by chance alone. When enough hypotheses are tested, it is virtually certain that some will be reported to be statistically significant (even though this is misleading), since almost every data set with any degree of randomness is likely to contain (for example) some [[spurious correlation|spurious correlations]]. If they are not cautious, researchers using data mining techniques can be easily misled by these results. The term ''p-hacking'' (in reference to [[p-value|''p''-values]]) was coined in a 2014 paper by the three researchers behind the blog [[Data Colada]], which has been focusing on uncovering such problems in social sciences research.<ref name=":22">{{Cite magazine |last=Lewis-Kraus |first=Gideon |date=2023-09-30 |title=They Studied Dishonesty. Was Their Work a Lie? |language=en-US |magazine=The New Yorker |url=https://www.newyorker.com/magazine/2023/10/09/they-studied-dishonesty-was-their-work-a-lie |access-date=2023-10-01 |issn=0028-792X}}</ref><ref name=":3">{{Cite web |last=Subbaraman |first=Nidhi |date=2023-09-24 |title=The Band of Debunkers Busting Bad Scientists |url=https://www.wsj.com/science/data-colada-debunk-stanford-president-research-14664f3 |url-status=live |archive-url=https://archive.today/20230924094046/https://www.wsj.com/science/data-colada-debunk-stanford-president-research-14664f3 |archive-date=2023-09-24 |access-date=2023-10-08 |website=[[Wall Street Journal]] |language=en-US}}</ref><ref>{{Cite web |title=APA PsycNet |url=https://psycnet.apa.org/record/2013-25331-001 |access-date=2023-10-08 |website=psycnet.apa.org |language=en}}</ref> Data dredging is an example of disregarding the [[multiple comparisons problem]]. One form is when subgroups are compared without alerting the reader to the total number of subgroup comparisons examined.<ref name="Deming">{{Cite journal |author1=Young, S. S. |author2=Karr, A. |title = Deming, data and observational studies |journal = Significance |volume = 8 |issue = 3 |year = 2011 |url = http://www.niss.org/sites/default/files/Young%20Karr%20Obs%20Study%20Problem.pdf |doi = 10.1111/j.1740-9713.2011.00506.x |pages=116β120 |doi-access = free }} </ref> When misused it is a [[Questionable research practices|questionable research practice]] that can undermine scientific integrity.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)