Editing Statistical hypothesis test (section)

=={{anchor|Controversy}}Modern origins and early controversy==
Modern significance testing is largely the product of [[Karl Pearson]] ([[p-value|''p''-value]], [[Pearson's chi-squared test]]), [[William Sealy Gosset]] ([[Student's t-distribution]]), and [[Ronald Fisher]] ("[[null hypothesis]]", [[analysis of variance]], "[[statistical significance|significance test]]"), while hypothesis testing was developed by [[Jerzy Neyman]] and [[Egon Pearson]] (son of Karl). Ronald Fisher began his life in statistics as a Bayesian (Zabell 1992), but Fisher soon grew disenchanted with the subjectivity involved (namely use of the [[principle of indifference]] when determining prior probabilities), and sought to provide a more "objective" approach to inductive inference.<ref name="ftp.isds.duke">Raymond Hubbard, [[M. J. Bayarri]], ''[http://ftp.isds.duke.edu/WorkingPapers/03-26.pdf P Values are not Error Probabilities] {{webarchive|url=https://web.archive.org/web/20130904000350/http://ftp.isds.duke.edu/WorkingPapers/03-26.pdf|date=September 4, 2013}}''. A working paper that explains the difference between Fisher's evidential ''p''-value and the Neyman–Pearson Type I error rate <math>\alpha</math>.</ref>

Fisher emphasized rigorous experimental design and methods to extract a result from few samples assuming [[Gaussian distribution]]s. Neyman (who teamed with the younger Pearson) emphasized mathematical rigor and methods to obtain more results from many samples and a wider range of distributions. Modern hypothesis testing is an inconsistent hybrid of the Fisher vs Neyman/Pearson formulation, methods and terminology developed in the early 20th century.

Fisher popularized the "significance test". He required a null-hypothesis (corresponding to a population frequency distribution) and a sample. His (now familiar) calculations determined whether to reject the null-hypothesis or not. Significance testing did not utilize an alternative hypothesis so there was no concept of a [[Type II error]] (false negative).

The ''p''-value was devised as an informal, but objective, index meant to help a researcher determine (based on other knowledge) whether to modify future experiments or strengthen one's [[Fiducial inference|faith]] in the null hypothesis.<ref name="Fisher 1955 69–78">{{cite journal|last=Fisher|first=R|year=1955|title=Statistical Methods and Scientific Induction|url=http://www.phil.vt.edu/dmayo/PhilStatistics/Triad/Fisher%201955.pdf|journal=Journal of the Royal Statistical Society, Series B|volume=17|issue=1|pages=69–78|doi=10.1111/j.2517-6161.1955.tb00180.x}}</ref> Hypothesis testing (and Type I/II errors) was devised by Neyman and Pearson as a more objective alternative to Fisher's ''p''-value, also meant to determine researcher behaviour, but without requiring any [[inductive inference]] by the researcher.<ref name="Neyman 289–337">{{cite journal|last1=Neyman|first1=J|last2=Pearson|first2=E. S.|date=January 1, 1933|title=On the Problem of the most Efficient Tests of Statistical Hypotheses|journal=[[Philosophical Transactions of the Royal Society A]]|volume=231|issue=694–706|pages=289–337|bibcode=1933RSPTA.231..289N|doi=10.1098/rsta.1933.0009|doi-access=free}}</ref><ref>{{cite journal|last=Goodman|first=S N|date=June 15, 1999|title=Toward evidence-based medical statistics. 1: The P Value Fallacy|journal=Ann Intern Med|volume=130|issue=12|pages=995–1004|doi=10.7326/0003-4819-130-12-199906150-00008|pmid=10383371|s2cid=7534212}}</ref>

Neyman & Pearson considered a different problem to Fisher (which they called "hypothesis testing"). They initially considered two simple hypotheses (both with frequency distributions). They calculated two probabilities and typically selected the hypothesis associated with the higher probability (the hypothesis more likely to have generated the sample). Their method always selected a hypothesis. It also allowed the calculation of both types of error probabilities.

Fisher and Neyman/Pearson clashed bitterly. Neyman/Pearson considered their formulation to be an improved generalization of significance testing (the defining paper<ref name="Neyman 289–337" /> was [[Neyman–Pearson lemma|abstract]]; Mathematicians have generalized and refined the theory for decades<ref name="Lehmann93" />). Fisher thought that it was not applicable to scientific research because often, during the course of the experiment, it is discovered that the initial assumptions about the null hypothesis are questionable due to unexpected sources of error. He believed that the use of rigid reject/accept decisions based on models formulated before data is collected was incompatible with this common scenario faced by scientists and attempts to apply this method to scientific research would lead to mass confusion.<ref>{{cite journal|last=Fisher|first=R N|year=1958|title=The Nature of Probability|url=http://www.york.ac.uk/depts/maths/histstat/fisher272.pdf|journal=Centennial Review|volume=2|pages=261–274|quote=We are quite in danger of sending highly trained and highly intelligent young men out into the world with tables of erroneous numbers under their arms, and with a dense fog in the place where their brains ought to be. In this century, of course, they will be working on guided missiles and advising the medical profession on the control of disease, and there is no limit to the extent to which they could impede every sort of national effort.}}
</ref>

The dispute between Fisher and Neyman–Pearson was waged on philosophical grounds, characterized by a philosopher as a dispute over the proper role of models in statistical inference.<ref name="Lenhard">{{cite journal|last=Lenhard|first=Johannes|year=2006|title=Models and Statistical Inference: The Controversy between Fisher and Neyman–Pearson|journal=Br. J. Philos. Sci.|volume=57|pages=69–91|doi=10.1093/bjps/axi152|s2cid=14136146}}</ref>

Events intervened: Neyman accepted a position in the [[University of California, Berkeley]] in 1938, breaking his partnership with Pearson and separating the disputants (who had occupied the same building). [[World War II]] provided an intermission in the debate. The dispute between Fisher and Neyman terminated (unresolved after 27 years) with Fisher's death in 1962. Neyman wrote a well-regarded eulogy.<ref>{{cite journal|last1=Neyman|first1=Jerzy|year=1967|title=RA Fisher (1890—1962): An Appreciation.|journal=Science|volume=156|issue=3781|pages=1456–1460|bibcode=1967Sci...156.1456N|doi=10.1126/science.156.3781.1456|pmid=17741062|s2cid=44708120}}</ref> Some of Neyman's later publications reported ''p''-values and significance levels.<ref>{{cite journal|last1=Losavich|first1=J. L.|last2=Neyman|first2=J.|last3=Scott|first3=E. L.|last4=Wells|first4=M. A.|year=1971|title=Hypothetical explanations of the negative apparent effects of cloud seeding in the Whitetop Experiment.|journal=Proceedings of the National Academy of Sciences of the United States of America|volume=68|issue=11|pages=2643–2646|bibcode=1971PNAS...68.2643L|doi=10.1073/pnas.68.11.2643|pmc=389491|pmid=16591951|doi-access=free}}</ref>

==={{anchor|NHST}}Null hypothesis significance testing (NHST)===
The modern version of hypothesis testing is generally called the '''null hypothesis significance testing (NHST)'''<ref name=nickerson /> and is a hybrid of the Fisher approach with the Neyman-Pearson approach. In 2000, [[Raymond S. Nickerson]] wrote an article stating that NHST was (at the time) "arguably the most widely used method of analysis of data collected in psychological experiments and has been so for about 70 years" and that it was at the same time "very controversial".<ref name=nickerson />

This fusion resulted from confusion by writers of statistical textbooks (as predicted by Fisher) beginning in the 1940s<ref name="Halpin 625–653">{{cite journal|last1=Halpin|first1=P F|last2=Stam|first2=HJ|date=Winter 2006|title=Inductive Inference or Inductive Behavior: Fisher and Neyman: Pearson Approaches to Statistical Testing in Psychological Research (1940–1960)|journal=The American Journal of Psychology|volume=119|issue=4|pages=625–653|doi=10.2307/20445367|jstor=20445367|pmid=17286092}}</ref> (but [[Detection theory|signal detection]], for example, still uses the Neyman/Pearson formulation). Great conceptual differences and many caveats in addition to those mentioned above were ignored. Neyman and Pearson provided the stronger terminology, the more rigorous mathematics and the more consistent philosophy, but the subject taught today in introductory statistics has more similarities with Fisher's method than theirs.<ref name="Gigerenzer">{{cite book|last=Gigerenzer|first=Gerd|title=The Empire of Chance: How Probability Changed Science and Everyday Life|author2=Zeno Swijtink|author3=Theodore Porter|author4=Lorraine Daston|author5=John Beatty|author6=Lorenz Kruger|publisher=Cambridge University Press|year=1989|isbn=978-0-521-39838-1|pages=70–122|chapter=Part 3: The Inference Experts}}</ref>

Sometime around 1940,<ref name="Halpin 625–653" /> authors of statistical text books began combining the two approaches by using the ''p''-value in place of the [[test statistic]] (or data) to test against the Neyman–Pearson "significance level".

{| class="wikitable"
|+ A comparison between Fisherian, frequentist (Neyman–Pearson)
|-
! #
! Fisher's null hypothesis testing !! Neyman–Pearson decision theory
|-
| 1
| Set up a statistical null hypothesis. The null need not be a nil hypothesis (i.e., zero difference).
| Set up two statistical hypotheses, H1 and H2, and decide about α, β, and sample size before the experiment, based on subjective cost-benefit considerations. These define a rejection region for each hypothesis.
|-
| 2
| Report the exact level of significance (e.g. p = 0.051 or p = 0.049). Do not refer to "accepting" or "rejecting" hypotheses. If the result is "not significant", draw no conclusions and make no decisions, but suspend judgement until further data is available.
| If the data falls into the rejection region of H1, accept H2; otherwise accept H1. Accepting a hypothesis does not mean that you believe in it, but only that you act as if it were true.
|-
| 3
| Use this procedure only if little is known about the problem at hand, and only to draw provisional conclusions in the context of an attempt to understand the experimental situation.
| The usefulness of the procedure is limited among others to situations where you have a disjunction of hypotheses (e.g. either μ1 = 8 or μ2 = 10 is true) and where you can make meaningful cost-benefit trade-offs for choosing alpha and beta.
|}