Editing Statistical hypothesis test (section)

==Examples==
===Human sex ratio===
{{main|Human sex ratio}}
The earliest use of statistical hypothesis testing is generally credited to the question of whether male and female births are equally likely (null hypothesis), which was addressed in the 1700s by [[John Arbuthnot]] (1710),<ref>{{cite journal|author=John Arbuthnot|year=1710|title=An argument for Divine Providence, taken from the constant regularity observed in the births of both sexes|url=http://www.york.ac.uk/depts/maths/histstat/arbuthnot.pdf|journal=[[Philosophical Transactions of the Royal Society of London]]|volume=27|issue=325–336|pages=186–190|doi=10.1098/rstl.1710.0011|doi-access=free|s2cid=186209819}}</ref> and later by [[Pierre-Simon Laplace]] (1770s).<ref>{{cite book|last1=Brian|first1=Éric|url=https://archive.org/details/descenthumansexr00bria|title=The Descent of Human Sex Ratio at Birth|last2=Jaisson|first2=Marie|publisher=Springer Science & Business Media|year=2007|isbn=978-1-4020-6036-6|pages=[https://archive.org/details/descenthumansexr00bria/page/n17 1]–25|chapter=Physico-Theology and Mathematics (1710–1794)|url-access=limited}}</ref>

Arbuthnot examined birth records in London for each of the 82 years from 1629 to 1710, and applied the [[sign test]], a simple [[non-parametric test]].<ref name="Conover1999">{{Citation|last=Conover|first=W.J.|title=Practical Nonparametric Statistics|pages=157–176|year=1999|chapter=Chapter 3.4: The Sign Test|edition=Third|publisher=Wiley|isbn=978-0-471-16068-7}}</ref><ref name="Sprent1989">{{Citation|last=Sprent|first=P.|title=Applied Nonparametric Statistical Methods|year=1989|edition=Second|publisher=Chapman & Hall|isbn=978-0-412-44980-2}}</ref><ref>{{cite book|last=Stigler|first=Stephen M.|title=The History of Statistics: The Measurement of Uncertainty Before 1900|publisher=Harvard University Press|year=1986|isbn=978-0-67440341-3|pages=[https://archive.org/details/historyofstatist00stig/page/225 225–226]}}</ref> In every year, the number of males born in London exceeded the number of females. Considering more male or more female births as equally likely, the probability of the observed outcome is 0.5<sup>82</sup>, or about 1 in 4,836,000,000,000,000,000,000,000; in modern terms, this is the ''p''-value. Arbuthnot concluded that this is too small to be due to chance and must instead be due to divine providence: "From whence it follows, that it is Art, not Chance, that governs." In modern terms, he rejected the null hypothesis of equally likely male and female births at the ''p''&nbsp;=&nbsp;1/2<sup>82</sup> significance level.

Laplace considered the statistics of almost half a million births. The statistics showed an excess of boys compared to girls.<ref name="Laplace 1778">{{cite journal|last=Laplace|first=P.|year=1778|title=Mémoire sur les probabilités|url=https://portal.getty.edu/books/bnf_bd6t54192707f|journal=Mémoires de l'Académie Royale des Sciences de Paris|pages=227–332}} Reprinted in {{cite book|last=Laplace|first=P.|title=Oeuvres complètes de Laplace|volume=9|pages=383–488|chapter=Mémoire sur les probabilités (XIX, XX)|chapter-url=http://gallica.bnf.fr/ark:/12148/bpt6k77597p/f386|publisher=Gauthier-Villars|year=1878–1912}} English translation: 
{{cite web|last=Laplace|first=P.|title=Mémoire sur les probabilités|translator-first=Richard J.|translator-last=Pulskam|date=August 21, 2010|url=http://cerebro.xu.edu/math/Sources/Laplace/memoir_probabilities.pdf
|archive-date=April 27, 2015|archive-url=https://web.archive.org/web/20150427142452/http://cerebro.xu.edu/math/Sources/Laplace/memoir_probabilities.pdf|url-status=dead}}</ref> He concluded by calculation of a ''p''-value that the excess was a real, but unexplained, effect.<ref>{{cite book|last=Stigler|first=Stephen M.|url=https://archive.org/details/historyofstatist00stig/page/134|title=The History of Statistics: The Measurement of Uncertainty before 1900|publisher=Belknap Press of Harvard University Press|year=1986|isbn=978-0-674-40340-6|location=Cambridge, Mass|page=[https://archive.org/details/historyofstatist00stig/page/134 134]}}</ref>

===Lady tasting tea===
{{main|Lady tasting tea}}

In a famous example of hypothesis testing, known as the ''Lady tasting tea'',<ref name="fisher">{{cite book|last=Fisher|first=Sir Ronald A.|title=The World of Mathematics, volume 3|publisher=Courier Dover Publications|year=2000|isbn=978-0-486-41151-4|editor=James Roy Newman|trans-title=Design of Experiments|chapter=Mathematics of a Lady Tasting Tea|author-link=Ronald Fisher|orig-year=1935|chapter-url=https://books.google.com/books?id=oKZwtLQTmNAC&q=%22mathematics+of+a+lady+tasting+tea%22&pg=PA1512}} Originally from Fisher's book ''Design of Experiments''.</ref> Dr. [[Muriel Bristol]], a colleague of Fisher, claimed to be able to tell whether the tea or the milk was added first to a cup. Fisher proposed to give her eight cups, four of each variety, in random order. One could then ask what the probability was for her getting the number she got correct, but just by chance. The null hypothesis was that the Lady had no such ability. The test statistic was a simple count of the number of successes in selecting the 4 cups. The critical region was the single case of 4 successes of 4 possible based on a conventional probability criterion (<&nbsp;5%). A pattern of 4 successes corresponds to 1 out of 70 possible combinations (p≈&nbsp;1.4%). Fisher asserted that no alternative hypothesis was (ever) required. The lady correctly identified every cup,<ref>{{cite book|last=Box|first=Joan Fisher|title=R.A. Fisher, The Life of a Scientist|publisher=Wiley|year=1978|isbn=978-0-471-09300-8|location=New York|page=134}}</ref> which would be considered a statistically significant result.

===Courtroom trial===
A statistical test procedure is comparable to a criminal [[trial (law)|trial]]; a defendant is considered not guilty as long as his or her guilt is not proven. The prosecutor tries to prove the guilt of the defendant. Only when there is enough evidence for the prosecution is the defendant convicted.

In the start of the procedure, there are two hypotheses <math>H_0</math>: "the defendant is not guilty", and <math>H_1</math>: "the defendant is guilty". The first one, <math>H_0</math>, is called the ''[[null hypothesis]]''. The second one, <math>H_1</math>, is called the ''alternative hypothesis''. It is the alternative hypothesis that one hopes to support.

The hypothesis of innocence is rejected only when an error is very unlikely, because one does not want to convict an innocent defendant. Such an error is called ''[[error of the first kind]]'' (i.e., the conviction of an innocent person), and the occurrence of this error is controlled to be rare. As a consequence of this asymmetric behaviour, an ''[[error of the second kind]]'' (acquitting a person who committed the crime), is more common.

{|class="wikitable"
|
! H<sub>0</sub> is true <br /> Truly not guilty
! H<sub>1</sub> is true <br /> Truly guilty
|- align="center"
! Do not reject the null hypothesis <br /> Acquittal
| {{success|Right decision}}
| {{failure|Wrong decision}} <br /> Type II Error
|- align="center"
! Reject null hypothesis <br /> Conviction
| {{failure|Wrong decision}} <br /> Type I Error
| {{success|Right decision}}
|}

A criminal trial can be regarded as either or both of two decision processes: guilty vs not guilty or evidence vs a threshold ("beyond a reasonable doubt"). In one view, the defendant is judged; in the other view the performance of the prosecution (which bears the burden of proof) is judged. A hypothesis test can be regarded as either a judgment of a hypothesis or as a judgment of evidence.

===Clairvoyant card game===
A person (the subject) is tested for [[clairvoyance]]. They are shown the back face of a randomly chosen playing card 25 times and asked which of the four [[Suit (cards)|suits]] it belongs to. The number of hits, or correct answers, is called ''X''.

As we try to find evidence of their clairvoyance, for the time being the null hypothesis is that the person is not clairvoyant.<ref>{{cite book|last1=Jaynes|first1=E. T.|title=Probability theory : the logic of science|date=2007|publisher=Cambridge Univ. Press|isbn=978-0-521-59271-0|edition=5. print.|location=Cambridge [u.a.]}}</ref> The alternative is: the person is (more or less) clairvoyant.

If the null hypothesis is valid, the only thing the test person can do is guess. For every card, the probability (relative frequency) of any single suit appearing is 1/4. If the alternative is valid, the test subject will predict the suit correctly with probability greater than 1/4. We will call the probability of guessing correctly ''p''. The hypotheses, then, are:
* null hypothesis <math>\text{:} \qquad H_0: p = \tfrac 14</math> &nbsp;&nbsp;&nbsp;&nbsp;(just guessing)
and
* alternative hypothesis <math>\text{:} H_1: p > \tfrac 14</math> &nbsp;&nbsp;&nbsp;(true clairvoyant).

When the test subject correctly predicts all 25 cards, we will consider them clairvoyant, and reject the null hypothesis. Thus also with 24 or 23 hits. With only 5 or 6 hits, on the other hand, there is no cause to consider them so. But what about 12 hits, or 17 hits? What is the critical number, ''c'', of hits, at which point we consider the subject to be clairvoyant? How do we determine the critical value ''c''? With the choice ''c''=25 (i.e. we only accept clairvoyance when all cards are predicted correctly) we're more critical than with ''c''=10. In the first case almost no test subjects will be recognized to be clairvoyant, in the second case, a certain number will pass the test. In practice, one decides how critical one will be. That is, one decides how often one accepts an error of the first kind – a [[false positive]], or Type I error. With ''c'' = 25 the probability of such an error is:

:{{nowrap|<math>P(\text{reject }H_0 \mid H_0 \text{ is valid}) = P\left(X = 25\mid p=\frac 14\right)=\left(\frac 14\right)^{25}\approx10^{-15}</math>,}}

and hence, very small. The probability of a false positive is the probability of randomly guessing correctly all 25 times.

Being less critical, with ''c'' = 10, gives:

:{{nowrap|<math>P(\text{reject }H_0 \mid H_0 \text{ is valid}) = P\left(X \ge 10 \mid p=\frac 14\right) = \sum_{k=10}^{25}P\left(X=k\mid p=\frac 14\right) = \sum_{k=10}^{25} \binom{25}{k}\left( 1- \frac 14\right)^{25-k} \left(\frac 14\right)^k \approx 0.0713</math>.}}

Thus, ''c'' = 10 yields a much greater probability of false positive.

Before the test is actually performed, the maximum acceptable probability of a Type I error (''α'') is determined. Typically, values in the range of 1% to 5% are selected. (If the maximum acceptable error rate is zero, an infinite number of correct guesses is required.) Depending on this Type 1 error rate, the critical value ''c'' is calculated. For example, if we select an error rate of 1%, ''c'' is calculated thus:

:{{nowrap|<math>P(\text{reject }H_0 \mid H_0 \text{ is valid}) = P\left(X \ge c\mid p=\frac 14\right) \le 0.01</math>.}}

From all the numbers c, with this property, we choose the smallest, in order to minimize the probability of a Type II error, a [[false negative]]. For the above example, we select: <math>c=13</math>.
<!--
But what if the subject did not guess any cards at all? Having zero correct answers is clearly an oddity too. Without any clairvoyant skills the probability.

:<math>P(X=0 \mid H_0 \text{ is valid}) = P\left(X = 0\mid p=\frac 14\right) = \left(1-\frac 14\right)^{25} \approx 0.00075</math>.

This is highly unlikely (less than 1 in a 1000 chance). While the subject can't guess the cards correctly, dismissing H<sub>0</sub> in favour of H<sub>1</sub> would be an error. In fact, the result would suggest a trait on the subject's part of avoiding calling the correct card. A test of this could be formulated: for a selected 1% error rate the subject would have to answer correctly at least twice, for us to believe that card calling is based purely on guessing. -->