Editing Statistical hypothesis test (section)

==Performing a frequentist hypothesis test in practice==
The typical steps involved in performing a frequentist hypothesis test in practice are:
# Define a hypothesis (claim which is testable using data).
# Select a relevant statistical test with associated [[test statistic]] <var>T</var>.
# Derive the distribution of the test statistic under the null hypothesis from the assumptions. In standard cases this will be a well-known result. For example, the test statistic might follow a [[Student's t distribution]] with known degrees of freedom, or a [[normal distribution]] with known mean and variance.
# Select a significance level (''α''), the maximum acceptable [[false positive rate]]. Common values are 5% and 1%.
# Compute from the observations the observed value <var>t</var><sub>obs</sub> of the test statistic <var>T</var>.
# Decide to either reject the null hypothesis in favor of the alternative or not reject it. The [[Neyman–Pearson lemma|Neyman-Pearson]] decision rule is to reject the null hypothesis <var>H</var><sub>0</sub> if the observed value <var>t</var><sub>obs</sub> is in the critical region, and not to reject the null hypothesis otherwise.<ref>{{Cite journal |date=2005 |title=Testing Statistical Hypotheses |url=https://link.springer.com/book/10.1007/0-387-27605-X |journal=Springer Texts in Statistics |language=en |doi=10.1007/0-387-27605-x |isbn=978-0-387-98864-1 |issn=1431-875X}}</ref>

=== Practical example ===
The difference in the two processes applied to the radioactive suitcase example (below):
* "The Geiger-counter reading is 10. The limit is 9. Check the suitcase."
* "The Geiger-counter reading is high; 97% of safe suitcases have lower readings. The limit is 95%. Check the suitcase."
The former report is adequate, the latter gives a more detailed explanation of the data and the reason why the suitcase is being checked.

Not rejecting the null hypothesis does not mean the null hypothesis is "accepted" per se (though Neyman and Pearson used that word in their original writings; see the [[#Interpretation|Interpretation]] section).

The processes described here are perfectly adequate for computation. They seriously neglect the [[design of experiments]] considerations.<ref>{{cite book|author1=Hinkelmann, Klaus|author2=Kempthorne, Oscar |author-link2=Oscar Kempthorne |year=2008|title=Design and Analysis of Experiments|volume=I and II|edition=Second|publisher=Wiley|isbn=978-0-470-38551-7}}</ref><ref>{{cite book|last=Montgomery|first=Douglas|title=Design and analysis of experiments|publisher=Wiley|location=Hoboken, N.J.|year=2009|isbn=978-0-470-12866-4}}</ref>

It is particularly critical that appropriate sample sizes be estimated before conducting the experiment.

The phrase "test of significance" was coined by statistician [[Ronald Fisher]].<ref name="Fisher1925">R. A. Fisher (1925).''Statistical Methods for Research Workers'', Edinburgh: Oliver and Boyd, 1925, p.43.</ref>

===Interpretation===
When the null hypothesis is true and statistical assumptions are met, the probability that the p-value will be less than or equal to the significance level <math>\alpha</math> is at most <math>\alpha</math>. This ensures that the hypothesis test maintains its specified false positive rate (provided that statistical assumptions are met).<ref name="LR" />

The ''p''-value is the probability that a test statistic which is at least as extreme as the one obtained would occur under the null hypothesis. At a significance level of 0.05, a fair coin would be expected to (incorrectly) reject the null hypothesis (that it is fair) in 1 out of 20 tests on average. The ''p''-value does not provide the probability that either the null hypothesis or its opposite is correct (a common source of confusion).<ref>{{Cite journal|last=Nuzzo|first=Regina|author-link= Regina Nuzzo |date=2014|title=Scientific method: Statistical errors|journal=Nature|volume=506|issue=7487|pages=150–152|bibcode=2014Natur.506..150N|doi=10.1038/506150a|pmid=24522584|doi-access=free}}</ref>

If the ''p''-value is less than the chosen significance threshold (equivalently, if the observed test statistic is in the critical region), then we say the null hypothesis is rejected at the chosen level of significance. If the ''p''-value is ''not'' less than the chosen significance threshold (equivalently, if the observed test statistic is outside the critical region), then the null hypothesis is not rejected at the chosen level of significance.

In the "lady tasting tea" example (below), Fisher required the lady to properly categorize all of the cups of tea to justify the conclusion that the result was unlikely to result from chance. His test revealed that if the lady was effectively guessing at random (the null hypothesis), there was a 1.4% chance that the observed results (perfectly ordered tea) would occur.

===Use and importance===
Statistics are helpful in analyzing most collections of data. This is equally true of hypothesis testing which can justify conclusions even when no scientific theory exists. In the Lady tasting tea example, it was "obvious" that no difference existed between (milk poured into tea) and (tea poured into milk). The data contradicted the "obvious".

Real world applications of hypothesis testing include:<ref name=larsen>{{cite book|author1=Richard J. Larsen |author2=Donna Fox Stroup |title=Statistics in the Real World: a book of examples|publisher=Macmillan|isbn=978-0023677205|year=1976}}</ref>
* Testing whether more men than women suffer from nightmares
* Establishing authorship of documents
* Evaluating the effect of the full moon on behavior
* Determining the range at which a bat can detect an insect by echo
* Deciding whether hospital carpeting results in more infections
* Selecting the best means to stop smoking
* Checking whether bumper stickers reflect car owner behavior
* Testing the claims of handwriting analysts

Statistical hypothesis testing plays an important role in the whole of statistics and in [[statistical inference]]. For example, Lehmann (1992) in a review of the fundamental paper by Neyman and Pearson (1933) says: "Nevertheless, despite their shortcomings, the new paradigm formulated in the 1933 paper, and the many developments carried out within its framework continue to play a central role in both the theory and practice of statistics and can be expected to do so in the foreseeable future".

Significance testing has been the favored statistical tool in some experimental social sciences (over 90% of articles in the ''Journal of Applied Psychology'' during the early 1990s).<ref name=hubbard>{{cite journal|author1=Hubbard, R. |author2=Parsa, A. R. |author3=Luthy, M. R. |title=The Spread of Statistical Significance Testing in Psychology: The Case of the Journal of Applied Psychology |journal=Theory and Psychology |volume=7 |pages=545–554 |year=1997|doi=10.1177/0959354397074006 |issue=4|s2cid=145576828 }}</ref> Other fields have favored the estimation of parameters (e.g. [[effect size]]). Significance testing is used as a substitute for the traditional comparison of predicted value and experimental result at the core of the [[scientific method]]. When theory is only capable of predicting the sign of a relationship, a directional (one-sided) hypothesis test can be configured so that only a statistically significant result supports theory. This form of theory appraisal is the most heavily criticized application of hypothesis testing.

===Cautions===
"If the government required statistical procedures to carry warning labels like those on drugs, most inference methods would have long labels indeed."<ref name="moore">{{cite book|last=Moore|first=David|title=Introduction to the Practice of Statistics|publisher=W.H. Freeman and Co|location=New York|year=2003|page=426|isbn=9780716796572}}</ref> This caution applies to hypothesis tests and alternatives to them.

The successful hypothesis test is associated with a probability and a type-I error rate. The conclusion ''might'' be wrong.

The conclusion of the test is only as solid as the sample upon which it is based. The design of the experiment is critical. A number of unexpected effects have been observed including:
* The [[clever Hans effect]]. A horse appeared to be capable of doing simple arithmetic.
* The [[Hawthorne effect]]. Industrial workers were more productive in better illumination, and most productive in worse.
* The [[placebo effect]]. Pills with no medically active ingredients were remarkably effective.
A statistical analysis of misleading data produces misleading conclusions. The issue of data quality can be more subtle. In [[forecasting]] for example, there is no agreement on a measure of forecast accuracy. In the absence of a consensus measurement, no decision based on measurements will be without controversy.

Publication bias: Statistically nonsignificant results may be less likely to be published, which can bias the literature.

Multiple testing: When multiple true null hypothesis tests are conducted at once without adjustment, the overall probability of Type I error is higher than the nominal alpha level.<ref>{{cite journal |last1=Ranganathan |first1=Priya |last2=Pramesh |first2=C. S |last3=Buyse |first3=Marc |title=Common pitfalls in statistical analysis: The perils of multiple testing |journal=Perspect Clin Res |date=April–June 2016 |volume=7 |issue=2 |pages=106–107 |doi=10.4103/2229-3485.179436 |pmid=27141478|pmc=4840791 |doi-access=free }}</ref>

Those making critical decisions based on the results of a hypothesis test are prudent to look at the details rather than the conclusion alone. In the physical sciences most results are fully accepted only when independently confirmed. The general advice concerning statistics is, "Figures never lie, but liars figure" (anonymous).