Editing Testing hypotheses suggested by the data

{{Short description|Problem of circular reasoning in statistics}}
{{Distinguish-redirect|Post hoc theorizing|Post hoc analysis}}
{{More citations needed|date=January 2008}}

In [[statistics]], '''hypotheses suggested by a given dataset''', when tested with the same dataset that suggested them, are likely to be accepted even when they are not true.  This is because [[circular reasoning]] (double dipping) would be involved: something seems true in the limited data set; therefore we hypothesize that it is true in general; therefore we wrongly test it on the same, limited data set, which seems to confirm that it is true. Generating hypotheses based on data already observed, in the absence of testing them on new data, is referred to as '''''post hoc'' theorizing''' (from [[Latin language|Latin]] ''[[post hoc analysis|post hoc]]'', "after this").

The correct procedure is to test any hypothesis on a data set that was not used to generate the hypothesis.

==The general problem==

Testing a hypothesis suggested by the data can very easily result in false positives ([[type I error]]s). If one looks long enough and in enough different places, eventually data can be found to support any hypothesis. Yet, these positive data do not by themselves constitute [[scientific evidence|evidence]] that the hypothesis is correct. The negative test data that were thrown out are just as important, because they give one an idea of how common the positive results are compared to chance. Running an experiment, seeing a pattern in the data, proposing a hypothesis from that pattern, then using the ''same'' experimental data as evidence for the new hypothesis is extremely suspect, because data from all other experiments, completed or potential, has essentially been "thrown out" by choosing to look only at the experiments that suggested the new hypothesis in the first place.

A large set of tests as described above greatly inflates the [[probability]] of [[type I error]] as all but the data most favorable to the [[hypothesis]] is discarded. This is a risk, not only in [[statistical hypothesis testing|hypothesis testing]] but in all [[statistical inference]] as it is often problematic to accurately describe the process that has been followed in searching and discarding [[data]]. In other words, one wants to keep all data (regardless of whether they tend to support or refute the hypothesis) from "good tests", but it is sometimes difficult to figure out what a "good test" is. It is a particular problem in [[statistical model]]ling, where many different models are rejected by [[trial and error]] before publishing a result (see also [[overfitting]], [[publication bias]]).

The error is particularly prevalent in [[data mining]] and [[machine learning]]. It also commonly occurs in [[academic publishing]] where only reports of positive, rather than negative, results tend to be accepted, resulting in the effect known as [[publication bias]].

==Correct procedures==

All strategies for sound testing of hypotheses suggested by the data involve including a wider range of tests in an attempt to validate or refute the new hypothesis. These include:
*Collecting [[confirmation sample]]s
*[[Cross-validation (statistics)|Cross-validation]]
*Methods of compensation for [[multiple comparisons]]
*Simulation studies including adequate representation of the multiple-testing actually involved

[[Scheffé test|Henry Scheffé's simultaneous test]] of all contrasts in [[multiple comparisons|multiple comparison]] problems is the most{{Citation needed|date=February 2011}} well-known remedy in the case of [[analysis of variance]].<ref>[[Henry Scheffé]], "A Method for Judging All Contrasts in the Analysis of Variance", ''[[Biometrika]]'', 40, pages 87–104 (1953). {{doi|10.1093/biomet/40.1-2.87}}</ref> It is a method designed for testing hypotheses suggested by the data while avoiding the fallacy described above.

==See also==
*[[Bonferroni correction]]
*[[Data analysis]]
*[[Data dredging]], ''p''-hacking
*[[Exploratory data analysis]]
*[[HARKing]]
*[[Post hoc analysis]]
*[[Predictive analytics]]
*[[Texas sharpshooter fallacy]]
*[[Type I and type II errors]]
*[[Uncomfortable science]]

== Notes and references ==

{{reflist}}

[[Category:Statistical hypothesis testing]]
[[Category:Misuse of statistics]]
[[Category:Multiple comparisons]]