Editing Statistical hypothesis test (section)

== Neyman–Pearson hypothesis testing ==
An example of Neyman–Pearson hypothesis testing (or null hypothesis statistical significance testing) can be made by a change to the radioactive suitcase example. If the "suitcase" is actually a shielded container for the transportation of radioactive material, then a test might be used to select among three hypotheses: no radioactive source present, one present, two (all) present. The test could be required for safety, with actions required in each case. The [[Neyman–Pearson lemma]] of hypothesis testing says that a good criterion for the selection of hypotheses is the ratio of their probabilities (a [[likelihood-ratio test|likelihood ratio]]). A simple method of solution is to select the hypothesis with the highest probability for the Geiger counts observed. The typical result matches intuition: few counts imply no source, many counts imply two sources and intermediate counts imply one source. Notice also that usually there are problems for [[Philosophic burden of proof#Proving a negative|proving a negative]]. Null hypotheses should be at least [[Falsifiability|falsifiable]].

Neyman–Pearson theory can accommodate both prior probabilities and the costs of actions resulting from decisions.<ref name="Ash">{{cite book | last = Ash | first = Robert | title = Basic probability theory | publisher = Wiley | location = New York | year = 1970 | isbn = 978-0471034506 }}Section 8.2</ref> The former allows each test to consider the results of earlier tests (unlike Fisher's significance tests). The latter allows the consideration of economic issues (for example) as well as probabilities. A likelihood ratio remains a good criterion for selecting among hypotheses.

The two forms of hypothesis testing are based on different problem formulations. The original test is analogous to a true/false question; the Neyman–Pearson test is more like multiple choice. In the view of [[John Tukey|Tukey]]<ref name="Tukey60" /> the former produces a conclusion on the basis of only strong evidence while the latter produces a decision on the basis of available evidence. While the two tests seem quite different both mathematically and philosophically, later developments lead to the opposite claim. Consider many tiny radioactive sources. The hypotheses become 0,1,2,3... grains of radioactive sand. There is little distinction between none or some radiation (Fisher) and 0 grains of radioactive sand versus all of the alternatives (Neyman–Pearson). The major Neyman–Pearson paper of 1933<ref name="Neyman 289–337" /> also considered composite hypotheses (ones whose distribution includes an unknown parameter). An example proved the optimality of the (Student's) ''t''-test, "there can be no better test for the hypothesis under consideration" (p 321). Neyman–Pearson theory was proving the optimality of Fisherian methods from its inception.

Fisher's significance testing has proven a popular flexible statistical tool in application with little mathematical growth potential. Neyman–Pearson hypothesis testing is claimed as a pillar of mathematical statistics,<ref>{{cite journal
 | last = Stigler | first = Stephen M.
 | title = The History of Statistics in 1933
 | journal = Statistical Science
 | volume = 11 | issue = 3 | pages = 244–252 | date = August 1996
 | jstor=2246117 | doi=10.1214/ss/1032280216| doi-access = free}}</ref> creating a new paradigm for the field. It also stimulated new applications in [[statistical process control]], [[detection theory]], [[decision theory]] and [[game theory]]. Both formulations have been successful, but the successes have been of a different character.

The dispute over formulations is unresolved. Science primarily uses Fisher's (slightly modified) formulation as taught in introductory statistics. Statisticians study Neyman–Pearson theory in graduate school. Mathematicians are proud of uniting the formulations. Philosophers consider them separately. Learned opinions deem the formulations variously competitive (Fisher vs Neyman), incompatible<ref name="ftp.isds.duke" /> or complementary.<ref name="Lehmann93" /> The dispute has become more complex since Bayesian inference has achieved respectability.

The terminology is inconsistent. Hypothesis testing can mean any mixture of two formulations that both changed with time. Any discussion of significance testing vs hypothesis testing is doubly vulnerable to confusion.

Fisher thought that hypothesis testing was a useful strategy for performing industrial quality control, however, he strongly disagreed that hypothesis testing could be useful for scientists.<ref name="Fisher 1955 69–78"/>
Hypothesis testing provides a means of finding test statistics used in significance testing.<ref name="Lehmann93" /> The concept of power is useful in explaining the consequences of adjusting the significance level and is heavily used in [[sample size determination]]. The two methods remain philosophically distinct.<ref name=Lenhard/> They usually (but ''not always'') produce the same mathematical answer. The preferred answer is context dependent.<ref name="Lehmann93">{{cite journal|last=Lehmann|first=E. L.|title=The Fisher, Neyman–Pearson Theories of Testing Hypotheses: One Theory or Two?|journal=Journal of the American Statistical Association|volume=88|issue=424|pages=1242–1249|date=December 1993|doi=10.1080/01621459.1993.10476404}}</ref> While the existing merger of Fisher and Neyman–Pearson theories has been heavily criticized, modifying the merger to achieve Bayesian goals has been considered.<ref>{{cite journal|last=Berger|first=James O.|title=Could Fisher, Jeffreys and Neyman Have Agreed on Testing?|journal=Statistical Science|volume=18|issue=1|pages=1–32|year=2003|doi=10.1214/ss/1056397485|doi-access=free}}</ref>