Editing Statistical hypothesis test (section)

==Criticism==
{{see also|p-value#Misuse}}

Criticism of statistical hypothesis testing fills volumes.<ref name=morrison>{{cite book|orig-year=1970|year=2006|title=The Significance Test Controversy|editor1=Morrison, Denton |editor2=Henkel, Ramon |publisher=Aldine Transaction |isbn=978-0-202-30879-1}}</ref><ref>{{cite book|last=Oakes|first=Michael|title=Statistical Inference: A Commentary for the Social and Behavioural Sciences|publisher=Wiley|location=Chichester New York|year=1986|isbn=978-0471104438}}</ref><ref name=chow>{{cite book|first=Siu L.|last=Chow|year=1997|title=Statistical Significance: Rationale, Validity and Utility|publisher=SAGE Publications |isbn=978-0-7619-5205-3}}</ref><ref name=harlow>{{cite book|year=1997|title=What If There Were No Significance Tests?|editor1=Harlow, Lisa Lavoie |editor2=Stanley A. Mulaik |editor3=James H. Steiger |publisher=Lawrence Erlbaum Associates|isbn=978-0-8058-2634-0}}</ref><ref name=kline>{{cite book|last=Kline|first=Rex|title=Beyond Significance Testing: Reforming Data Analysis Methods in Behavioral Research|publisher=American Psychological Association|location=Washington, D.C. |year=2004|isbn=9781591471189 }}</ref><ref name=mccloskey>{{cite book|last= McCloskey|first=Deirdre N.|author2=Stephen T. Ziliak |year=2008|title=The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives|publisher=University of Michigan Press|isbn=978-0-472-05007-9}}</ref> Much of the criticism can be summarized by the following issues:
* The interpretation of a ''p''-value is dependent upon [[stopping rule]] and definition of multiple comparison. The former often changes during the course of a study and the latter is unavoidably ambiguous. (i.e. "p values depend on both the (data) observed and on the other possible (data) that might have been observed but weren't").<ref>{{cite journal|last=Cornfield|first=Jerome|title=Recent Methodological Contributions to Clinical Trials| journal=American Journal of Epidemiology|volume=104|issue=4|pages=408–421|year=1976|url=http://www.epidemiology.ch/history/PDF%20bg/Cornfield%20J%201976%20recent%20methodological%20contributions.pdf|doi=10.1093/oxfordjournals.aje.a112313|pmid= 788503}}</ref>
* Confusion resulting (in part) from combining the methods of Fisher and Neyman–Pearson which are conceptually distinct.<ref name="Tukey60">{{cite journal|last=Tukey|first=John W.|title=Conclusions vs decisions|journal= Technometrics|volume=26|issue=4|pages=423–433|year=1960|doi=10.1080/00401706.1960.10489909}} "Until we go through the accounts of testing hypotheses, separating [Neyman–Pearson] decision elements from [Fisher] conclusion elements, the intimate mixture of disparate elements will be a continual source of confusion." ... "There is a place for both "doing one's best" and "saying only what is certain," but it is important to know, in each instance, both which one is being done, and which one ought to be done."</ref>
* Emphasis on statistical significance to the exclusion of estimation and confirmation by repeated experiments.<ref>{{cite journal|last=Yates|first=Frank|title=The Influence of Statistical Methods for Research Workers on the Development of the Science of Statistics|journal=Journal of the American Statistical Association|volume=46|issue=253|pages=19–34|year=1951|doi=10.1080/01621459.1951.10500764}} "The emphasis given to formal tests of significance throughout [R.A. Fisher's] Statistical Methods ... has caused scientific research workers to pay undue attention to the results of the tests of significance they perform on their data, particularly data derived from experiments, and too little to the estimates of the magnitude of the effects they are investigating." ... "The emphasis on tests of significance and the consideration of the results of each experiment in isolation, have had the unfortunate consequence that scientific workers have often regarded the execution of a test of significance on an experiment as the ultimate objective."</ref>
* Rigidly requiring statistical significance as a criterion for publication, resulting in [[publication bias]].<ref>{{cite journal|last1=Begg|first1=Colin B.|last2=Berlin|first2=Jesse A.|title=Publication bias: a problem in interpreting medical data|journal=Journal of the Royal Statistical Society, Series A|volume=151|issue=3|pages=419–463|year=1988|doi=10.2307/2982993|jstor=2982993|s2cid=121054702 }}</ref> Most of the criticism is indirect. Rather than being wrong, statistical hypothesis testing is misunderstood, overused and misused.
* When used to detect whether a difference exists between groups, a paradox arises. As improvements are made to experimental design (e.g. increased precision of measurement and sample size), the test becomes more lenient. Unless one accepts the absurd assumption that all sources of noise in the data cancel out completely, the chance of finding statistical significance in either direction approaches 100%.<ref>{{cite journal|last=Meehl|first=Paul E.|title= Theory-Testing in Psychology and Physics: A Methodological Paradox|journal=Philosophy of Science|volume=34|issue=2|pages=103–115|year=1967|url=http://mres.gmu.edu/pmwiki/uploads/Main/Meehl1967.pdf|doi=10.1086/288135|s2cid=96422880| url-status=dead|archive-url=https://web.archive.org/web/20131203010657/http://mres.gmu.edu/pmwiki/uploads/Main/Meehl1967.pdf|archive-date=December 3, 2013|df=mdy-all}} Thirty years later, Meehl acknowledged statistical significance theory to be mathematically sound while continuing to question the default choice of null hypothesis, blaming instead the "social scientists' poor understanding of the logical relation between theory and fact" in "The Problem Is Epistemology, Not Statistics: Replace Significance Tests by Confidence Intervals and Quantify Accuracy of Risky Numerical Predictions" (Chapter 14 in Harlow (1997)).</ref> However, this absurd assumption that the mean difference between two groups cannot be zero implies that the data cannot be independent and identically distributed (i.i.d.) because the expected difference between any two subgroups of i.i.d. random variates is zero; therefore, the i.i.d. assumption is also absurd.
*Layers of philosophical concerns. The probability of statistical significance is a function of decisions made by experimenters/analysts.<ref name="bakan66">
{{cite journal |last=Bakan |first=David |year=1966 |title=The test of significance in psychological research |journal=Psychological Bulletin |volume=66 |issue=6 |pages=423–437 |doi=10.1037/h0020412 |pmid=5974619}}</ref> If the decisions are based on convention they are termed arbitrary or mindless<ref name="Gigerenzer 587–606">{{cite journal|last=Gigerenzer|first=G|title=Mindless statistics|journal=The Journal of Socio-Economics|date=November 2004|volume=33|issue=5|pages=587–606|doi=10.1016/j.socec.2004.09.033}}</ref> while those not so based may be termed subjective. To minimize type II errors, large samples are recommended. In psychology practically all null hypotheses are claimed to be false for sufficiently large samples so "...it is usually nonsensical to perform an experiment with the ''sole'' aim of rejecting the null hypothesis."<ref>{{cite journal | last = Nunnally | first = Jum | title = The place of statistics in psychology | journal = Educational and Psychological Measurement | volume = 20 | number = 4 | pages = 641–650 | year = 1960 | doi=10.1177/001316446002000401| s2cid = 144813784}}</ref> "Statistically significant findings are often misleading" in psychology.<ref>{{cite journal | last = Lykken | first = David T. | title = What's wrong with psychology, anyway? | journal = Thinking Clearly About Psychology | volume = 1 | pages = 3–39 | year = 1991}}</ref> Statistical significance does not imply practical significance, and [[correlation does not imply causation]]. Casting doubt on the null hypothesis is thus far from directly supporting the research hypothesis.
*"[I]t does not tell us what we want to know".<ref name=cohen94>{{cite journal|author=Jacob Cohen|title=The Earth Is Round (p < .05)|journal=American Psychologist|volume=49|issue=12|pages=997–1003|date=December 1994|doi=10.1037/0003-066X.49.12.997|s2cid=380942}} This paper lead to the review of statistical practices by the APA. Cohen was a member of the Task Force that did the review.</ref> Lists of dozens of complaints are available.<ref name=kline/><ref name="nickerson">{{cite journal|author=Nickerson, Raymond S.|title=Null Hypothesis Significance Tests: A Review of an Old and Continuing Controversy|journal=Psychological Methods|volume=5|issue=2|pages=241–301|year=2000|url=https://psycnet.apa.org/doiLanding?doi=10.1037%2F1082-989X.5.2.241|doi=10.1037/1082-989X.5.2.241|pmid=10937333|s2cid=28340967|archive-url=https://is.muni.cz/el/1423/jaro2010/PSY117/um/_Nickerson_-_NHST_controversy_review.pdf|archive-date=2000-02-23}}</ref><ref name="branch">{{cite journal|author=Branch, Mark|title=Malignant side effects of null hypothesis significance testing|journal=Theory & Psychology|volume=24|issue=2|pages=256–277|year=2014|doi=10.1177/0959354314525282|s2cid=40712136}}</ref>

Critics and supporters are largely in factual agreement regarding the characteristics of null hypothesis significance testing (NHST): While it can provide critical information, it is ''inadequate as the sole tool for statistical analysis''. ''Successfully rejecting the null hypothesis may offer no support for the research hypothesis.'' The continuing controversy concerns the selection of the best statistical practices for the near-term future given the existing practices. However, adequate research design can minimize this issue. Critics would prefer to ban NHST completely, forcing a complete departure from those practices,<ref>{{cite journal |last1=Hunter |first1=John E. |title=Needed: A Ban on the Significance Test |journal=Psychological Science |date=January 1997 |volume=8 |issue=1 |pages=3–7 |doi=10.1111/j.1467-9280.1997.tb00534.x|s2cid=145422959 }}</ref> while supporters suggest a less absolute change.{{citation needed|date=December 2015}}

Controversy over significance testing, and its effects on publication bias in particular, has produced several results. The [[American Psychological Association]] has strengthened its statistical reporting requirements after review,<ref name=wilkinson>{{cite journal|author=Wilkinson, Leland|title=Statistical Methods in Psychology Journals; Guidelines and Explanations|journal=American Psychologist|volume=54|issue=8|pages=594–604|year=1999|doi=10.1037/0003-066X.54.8.594|s2cid=428023 }} "Hypothesis tests. It is hard to imagine a situation in which a dichotomous accept-reject decision is better than reporting an actual p value or, better still, a confidence interval." (p 599). The committee used the cautionary term "forbearance" in describing its decision against a ban of hypothesis testing in psychology reporting. (p 603)</ref> [[medical journal]] publishers have recognized the obligation to publish some results that are not statistically significant to combat publication bias,<ref>{{cite web|url=http://www.icmje.org/publishing_1negative.html|title=ICMJE: Obligation to Publish Negative Studies|access-date=September 3, 2012|quote=Editors should seriously consider for publication any carefully done study of an important question, relevant to their readers, whether the results for the primary or any additional outcome are statistically significant. Failure to submit or publish findings because of lack of statistical significance is an important cause of publication bias.|url-status=dead|archive-url=https://web.archive.org/web/20120716211637/http://www.icmje.org/publishing_1negative.html|archive-date=July 16, 2012|df=mdy-all}}</ref> and a journal (''Journal of Articles in Support of the Null Hypothesis'') has been created to publish such results exclusively.<ref name=JASNH>''Journal of Articles in Support of the Null Hypothesis'' website: [http://www.jasnh.com/ JASNH homepage]. Volume 1 number 1 was published in 2002, and all articles are on psychology-related subjects.</ref> Textbooks have added some cautions,<ref>{{cite book|title=Statistical Methods for Psychology|last=Howell|first=David|year=2002|publisher=Duxbury|edition=5|isbn=978-0-534-37770-0|page=[https://archive.org/details/statisticalmetho0000howe/page/94 94]|url= https://archive.org/details/statisticalmetho0000howe/page/94}}</ref> and increased coverage of the tools necessary to estimate the size of the sample required to produce significant results. Few major organizations have abandoned use of significance tests although some have discussed doing so.<ref name=wilkinson/> For instance, in 2023, the editors of the [[Journal of Physiology]] "strongly recommend the use of estimation methods for those publishing in The Journal" (meaning  the magnitude of the [[effect size]] (to allow readers to judge whether a finding has practical, physiological, or clinical relevance) and [[confidence intervals]] to convey the precision of that estimate), saying "Ultimately, it is the physiological importance of the data that those publishing in The Journal of Physiology should be most concerned with, rather than the statistical significance."<ref name="WilliamsToth2023">{{cite journal |last1=Williams |first1=S. |last2=Carson |first2=R. |last3=Tóth |first3=K. |title=Moving beyond P values in The Journal of Physiology: A primer on the value of effect sizes and confidence intervals |journal=J Physiol |date=October 10, 2023 |volume=601 |issue=23 |pages=5131–5133 |doi=10.1113/JP285575 |pmid=37815959 |s2cid=263827430 |doi-access=free }}</ref>