Editing Data dredging (section)

==Types==

=== Drawing conclusions from data ===
The conventional [[statistical hypothesis testing]] procedure using [[frequentist probability]] is to formulate a research hypothesis, such as "people in higher social classes live longer", then collect relevant data. Lastly, a statistical [[significance test]] is carried out to see how likely the results are by chance alone (also called testing against the null hypothesis).

A key point in proper statistical analysis is to test a hypothesis with evidence (data) that was not used in constructing the hypothesis. This is critical because every [[data set]] contains some patterns due entirely to chance. If the hypothesis is not tested on a different data set from the same [[statistical population]], it is impossible to assess the likelihood that chance alone would produce such patterns.

For example, [[flipping a coin]] five times with a result of 2 heads and 3 tails might lead one to hypothesize that the coin favors tails by 3/5 to 2/5. If this hypothesis is then tested on the existing data set, it is confirmed, but the confirmation is meaningless. The proper procedure would have been to form in advance a hypothesis of what the tails probability is, and then throw the coin various times to see if the hypothesis is rejected or not. If three tails and two heads are observed, another hypothesis, that the tails probability is 3/5, could be formed, but it could only be tested by a new set of coin tosses. The statistical significance under the incorrect procedure is completely spurious—significance tests do not protect against data dredging.

=== Optional stopping ===
[[File:P-hacking by early stopping.svg|thumb|315x315px|The figure shows the change in p-values computed from a t-test as the sample size increases, and how early stopping can allow for p-hacking.

Data is drawn from two identical normal distributions, <math>N(0, 10)</math>. For each sample size <math>n</math>, ranging from 5 to <math>10^4</math>, a t-test is performed on the first <math>n</math> samples from each distribution, and the resulting p-value is plotted. The red dashed line indicates the commonly used significance level of 0.05. 

If the data collection or analysis were to stop at a point where the p-value happened to fall below the significance level, a spurious statistically significant difference could be reported.]]
Optional stopping is a practice where one collects data until some stopping criteria is reached. While it is a valid procedure, it is easily misused. The problem is that p-value of an optionally stopped statistical test is larger than what it seems. Intuitively, this is because the p-value is supposed to be the sum of all events at least as rare as what is observed. With optional stopping, there are even rarer events that are difficult to account for, i.e. not triggering the optional stopping rule, and collect even more data, before stopping. Neglecting these events leads to a p-value that's too low. In fact, if the null hypothesis is true, then ''any'' significance level can be reached if one is allowed to keep collecting data and stop when the desired p-value (calculated as if one has always been planning to collect exactly this much data) is obtained.<ref name=":9">{{Cite journal |last=Wagenmakers |first=Eric-Jan |date=October 2007 |title=A practical solution to the pervasive problems of p values |url=http://link.springer.com/10.3758/BF03194105 |journal=Psychonomic Bulletin & Review |language=en |volume=14 |issue=5 |pages=779–804 |doi=10.3758/BF03194105 |issn=1069-9384 |pmid=18087943}}</ref> For a concrete example of testing for a fair coin, see {{section link|P-value|Optional stopping|display=''p''-value}}.

Or, more succinctly, the proper calculation of p-value requires accounting for counterfactuals, that is, what the experimenter ''could'' have done in reaction to data that ''might'' have been. Accounting for what might have been is hard, even for honest researchers.<ref name=":9" /> One benefit of preregistration is to account for all counterfactuals, allowing the p-value to be calculated correctly.<ref>{{Cite journal |last1=Wicherts |first1=Jelte M. |last2=Veldkamp |first2=Coosje L. S. |last3=Augusteijn |first3=Hilde E. M. |last4=Bakker |first4=Marjan |last5=van Aert |first5=Robbie C. M. |last6=van Assen |first6=Marcel A. L. M. |date=2016-11-25 |title=Degrees of Freedom in Planning, Running, Analyzing, and Reporting Psychological Studies: A Checklist to Avoid p-Hacking |journal=Frontiers in Psychology |volume=7 |page=1832 |doi=10.3389/fpsyg.2016.01832 |issn=1664-1078 |pmc=5122713 |pmid=27933012 |doi-access=free}}</ref>

The problem of early stopping is not just limited to researcher misconduct. There is often pressure to stop early if the cost of collecting data is high. Some animal ethics boards even mandate early stopping if the study obtains a significant result midway.<ref name="mlh">{{Cite journal |last1=Head |first1=Megan L. |last2=Holman |first2=Luke |last3=Lanfear |first3=Rob |last4=Kahn |first4=Andrew T. |last5=Jennions |first5=Michael D. |date=2015-03-13 |title=The Extent and Consequences of P-Hacking in Science |journal=PLOS Biology |language=en |volume=13 |issue=3 |pages=e1002106 |doi=10.1371/journal.pbio.1002106 |issn=1545-7885 |pmc=4359000 |pmid=25768323 |doi-access=free}}</ref>

=== Post-hoc data replacement ===
If data is removed ''after'' some data analysis is already done on it, such as on the pretext of "removing outliers", then it would increase the false positive rate. Replacing "outliers" by replacement data increases the false positive rate further.<ref name=":0">{{Cite journal |last=Szucs |first=Denes |date=2016-09-22 |title=A Tutorial on Hunting Statistical Significance by Chasing N |journal=Frontiers in Psychology |language=English |volume=7 |doi=10.3389/fpsyg.2016.01444 |doi-access=free |pmid=27713723 |issn=1664-1078|pmc=5031612 }}</ref>

=== Post-hoc grouping ===
If a dataset contains multiple features, then one or more of the features can be used as grouping, and potentially create a statistically significant result. For example, if a dataset of patients records their age and sex, then a researcher can consider grouping them by age and check if the illness recovery rate is correlated with age. If it does not work, then the researcher might check if it correlates with sex. If not, then perhaps it correlates with age after controlling for sex, etc. The number of possible groupings grows exponentially with the number of features.<ref name=":0" />

=== Hypothesis suggested by non-representative data ===
{{main article|Testing hypotheses suggested by the data}}

Suppose that a study of a [[random sample]] of people includes exactly two people with a birthday of August 7: Mary and John. Someone engaged in data dredging might try to find additional similarities between Mary and John. By going through hundreds or thousands of potential similarities between the two, each having a low probability of being true, an unusual similarity can almost certainly be found. Perhaps John and Mary are the only two people in the study who switched minors three times in college. A hypothesis, biased by data dredging, could then be "people born on August 7 have a much higher chance of switching minors more than twice in college."

The data itself taken out of context might be seen as strongly supporting that correlation, since no one with a different birthday had switched minors three times in college. However, if (as is likely) this is a spurious hypothesis, this result will most likely not be [[reproducible]]; any attempt to check if others with an August 7 birthday have a similar rate of changing minors will most likely get contradictory results almost immediately.

=== Systematic bias ===
{{main article|Bias}}
Bias is a systematic error in the analysis. For example, doctors directed [[HIV]] patients at high cardiovascular risk to a particular HIV treatment, [[abacavir]], and lower-risk patients to other drugs, preventing a simple assessment of abacavir compared to other treatments. An analysis that did not correct for this bias unfairly penalized abacavir, since its patients were more high-risk so more of them had heart attacks.<ref name="Deming" /> This problem can be very severe, for example, in the [[observational study]].<ref name="Deming" /><ref name="bmj02">
{{Cite journal
|author1=Davey Smith, G.|author1-link=George Davey Smith
|author2=Ebrahim, S.
|title = Data dredging, bias, or confounding
|journal = BMJ
|volume = 325
|year = 2002
|pmc = 1124898
|doi = 10.1136/bmj.325.7378.1437
|pmid=12493654
|issue=7378
|pages=1437–1438}}
</ref>

Missing factors, unmeasured [[confounders]], and loss to follow-up can also lead to bias.<ref name="Deming" /> By selecting papers with significant [[p-value|''p''-values]], negative studies are selected against, which is [[publication bias]]. This is also known as ''file drawer bias'', because less significant ''p''-value results are left in the file drawer and never published.

=== Multiple modelling ===
Another aspect of the conditioning of [[statistical test]]s by knowledge of the data can be seen while using the {{clarify span|system or machine analysis and [[linear regression]] to observe the frequency of data.|date=October 2019}} A crucial step in the process is to decide which [[covariate]]s to include in a relationship explaining one or more other variables. There are both statistical (see [[stepwise regression]]) and substantive considerations that lead the authors to favor some of their models over others, and there is a liberal use of statistical tests. However, to discard one or more variables from an explanatory relation on the basis of the data means one cannot validly apply standard statistical procedures to the retained variables in the relation as though nothing had happened. In the nature of the case, the retained variables have had to pass some kind of preliminary test (possibly an imprecise intuitive one) that the discarded variables failed. In 1966, Selvin and Stuart compared variables retained in the model to the fish that don't fall through the net—in the sense that their effects are bound to be bigger than those that do fall through the net. Not only does this alter the performance of all subsequent tests on the retained explanatory model, but it may also introduce bias and alter [[mean square error]] in estimation.<ref name="Selvin">
{{Cite journal
|author1=Selvin, H.&nbsp;C.
|author2=Stuart, A.
|title = Data-Dredging Procedures in Survey Analysis
|journal = The American Statistician
|volume = 20
|issue = 3
|pages = 20–23
|year = 1966
|doi=10.1080/00031305.1966.10480401
|jstor=2681493}}
</ref><ref name="BerkBrownZhao">
{{Cite journal
|author1=Berk, R. |author2=Brown, L. |author3=Zhao, L.
|title = Statistical Inference After Model Selection
|journal = J Quant Criminol
|doi = 10.1007/s10940-009-9077-7
|year = 2009 |volume=26 |issue=2 |pages=217–236 |s2cid=10350955 |url=https://repository.upenn.edu/statistics_papers/540 }}
</ref>