Editing P-value (section)

==History==
[[File:Arbuthnot John Kneller.jpg|thumb|upright|200px|alt=Chest high painted portrait of man wearing a brown robe and head covering|[[John Arbuthnot]]]]
[[File:Pierre-Simon-Laplace (1749-1827).jpg|thumb|upright|200px|[[Pierre-Simon Laplace]]]]
[[File:Journalofheredit07ameruoft 0448.jpg|thumb|200px|alt=Man seated at his desk looking up at the camera|[[Karl Pearson]]]]
[[File:Youngronaldfisher2.JPG|thumb|upright|200px|alt=Sepia toned photo of young man wearing a suit, a medal, and wire-rimmed eyeglasses|[[Ronald Fisher]] ]]

''P''-value computations date back to the 1700s, where they were computed for the [[human sex ratio]] at birth, and used to compute statistical significance compared to the null hypothesis of equal probability of male and female births.<ref>{{cite book |title=The Descent of Human Sex Ratio at Birth |url=https://archive.org/details/descenthumansexr00bria |url-access=limited | vauthors = Brian E, Jaisson M |author-link1=Éric Brian |author-link2=Marie Jaisson |chapter=Physico-Theology and Mathematics (1710–1794) |pages=[https://archive.org/details/descenthumansexr00bria/page/n17 1]–25 |year=2007 |publisher=Springer Science & Business Media |isbn=978-1-4020-6036-6}}</ref> [[John Arbuthnot]] studied this question in 1710,<ref>{{cite journal| vauthors = Arbuthnot J |s2cid=186209819|title=An argument for Divine Providence, taken from the constant regularity observed in the births of both sexes|journal=[[Philosophical Transactions of the Royal Society of London]] | volume=27| pages=186–190 | year=1710 | url = http://www.york.ac.uk/depts/maths/histstat/arbuthnot.pdf|doi=10.1098/rstl.1710.0011|issue=325–336|doi-access=free}}</ref><ref name="Conover1999">{{cite book | vauthors = Conover WJ |title=Practical Nonparametric Statistics |edition=Third |year=1999 |publisher=Wiley |isbn=978-0-471-16068-7 |pages=157–176 |chapter=Chapter 3.4: The Sign Test }}</ref><ref name="Sprent1989">{{cite book | vauthors = Sprent P |title=Applied Nonparametric Statistical Methods |edition=Second |year=1989 |publisher=Chapman & Hall
|isbn=978-0-412-44980-2 }}</ref><ref>{{cite book |title = The History of Statistics: The Measurement of Uncertainty Before 1900 | vauthors = Stigler SM |publisher=Harvard University Press |year=1986 |isbn=978-0-67440341-3 |pages=[https://archive.org/details/historyofstatist00stig/page/225 225–226]}}</ref> and examined birth records in London for each of the 82 years from 1629 to 1710. In every year, the number of males born in London exceeded the number of females. Considering more male or more female births as equally likely, the probability of the observed outcome is 1/2<sup>82</sup>, or about 1 in 4,836,000,000,000,000,000,000,000; in modern terms, the ''p''-value. This is vanishingly small, leading Arbuthnot to conclude that this was not due to chance, but to divine providence: "From whence it follows, that it is Art, not Chance, that governs." In modern terms, he rejected the null hypothesis of equally likely male and female births at the ''p''&nbsp;=&nbsp;1/2<sup>82</sup> significance level. This and other work by Arbuthnot is credited as "… the first use of significance tests …"<ref name="Bellhouse2001">{{cite book | vauthors = Bellhouse P |title = Statisticians of the Centuries |editor1-link=Chris Heyde |editor2-link=Eugene Seneta | veditors = Heyde CC, Seneta E |year=2001 |publisher=Springer |isbn=978-0-387-95329-8 |pages=39–42 |chapter=John Arbuthnot}}</ref> the first example of reasoning about statistical significance,<ref name="Hald1998">{{cite book | vauthors = Hald A |title=A History of Mathematical Statistics from 1750 to 1930 |year=1998 |publisher=Wiley |pages=65 |chapter=Chapter 4. Chance or Design: Tests of Significance
}}</ref> and "… perhaps the first published report of a [[non-parametric test|nonparametric test]] …",<ref name="Conover1999" /> specifically the [[sign test]]; see details at {{section link|Sign test|History}}.

The same question was later addressed by [[Pierre-Simon Laplace]], who instead used a ''parametric'' test, modeling the number of male births with a [[binomial distribution]]:<ref>{{cite book |title = The History of Statistics: The Measurement of Uncertainty Before 1900 | vauthors = Stigler SM |publisher=Harvard University Press |year=1986 |isbn=978-0-67440341-3 |page=[https://archive.org/details/historyofstatist00stig/page/134 134]}}</ref>
{{blockquote|In the 1770s Laplace considered the statistics of almost half a million births. The statistics showed an excess of boys compared to girls. He concluded by calculation of a ''p''-value that the excess was a real, but unexplained, effect.}}

The ''p''-value was first formally introduced by [[Karl Pearson]], in his [[Pearson's chi-squared test]],<ref name="Pearson1900">{{cite journal | vauthors = Pearson K | author-link = Karl Pearson | title = On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling | doi = 10.1080/14786440009463897 | journal = Philosophical Magazine |series=Series 5 | volume = 50 | issue = 302 | pages = 157–175 | year = 1900 |url = http://www.economics.soton.ac.uk/staff/aldrich/1900.pdf }}</ref> using the [[chi-squared distribution]] and notated as capital P.<ref name="Pearson1900" /> The ''p''-values for the [[chi-squared distribution]] (for various values of χ<sup>2</sup> and degrees of freedom), now notated as ''P,'' were calculated in {{Harv|Elderton|1902}}, collected in {{Harv|Pearson|1914|pp=xxxi–xxxiii, 26–28|loc=Table XII}}.

[[Ronald Fisher]] formalized and popularized the use of the ''p''-value in statistics,<ref>{{Cite journal |last1=Biau |first1=David Jean |last2=Jolles |first2=Brigitte M. |last3=Porcher |first3=Raphaël |date=2010 |title=P Value and the Theory of Hypothesis Testing: An Explanation for New Researchers |journal=Clinical Orthopaedics and Related Research |volume=468 |issue=3 |pages=885–892 |doi=10.1007/s11999-009-1164-4 |issn=0009-921X |pmc=2816758 |pmid=19921345}}</ref><ref>{{Cite journal |last=Brereton |first=Richard G. |date=2021 |title=P values and multivariate distributions: Non-orthogonal terms in regression models |url=https://linkinghub.elsevier.com/retrieve/pii/S0169743921000320 |journal=Chemometrics and Intelligent Laboratory Systems |language=en |volume=210 |pages=104264 |doi=10.1016/j.chemolab.2021.104264}}</ref> with it playing a central role in his approach to the subject.<ref>{{citation | vauthors = Hubbard R, Bayarri MJ |title=Confusion Over Measures of Evidence (''p''′s) Versus Errors (α′s) in Classical Statistical Testing |journal=The American Statistician |volume=57 |year=2003 |issue=3 |pages=171–178 [p. 171] |doi=10.1198/0003130031856 |s2cid=55671953 }}</ref> In his highly influential book ''[[Statistical Methods for Research Workers]]'' (1925), Fisher proposed the level ''p'' = 0.05, or a 1 in 20 chance of being exceeded by chance, as a limit for [[statistical significance]], and applied this to a normal distribution (as a two-tailed test), thus yielding the rule of two standard deviations (on a normal distribution) for statistical significance (see [[68–95–99.7 rule]]).{{sfn|Fisher|1925|p=47|loc=Chapter [http://psychclassics.yorku.ca/Fisher/Methods/chap3.htm III. Distributions]}}{{NoteTag| 1 = To be more specific, the ''p'' = 0.05 corresponds to about 1.96 standard deviations for a normal distribution (two-tailed test), and 2 standard deviations corresponds to about a 1 in 22 chance of being exceeded by chance, or ''p'' ≈ 0.045; Fisher notes these approximations.}}{{sfn|Dallal|2012|loc=Note 31: [http://www.jerrydallal.com/LHSP/p05.htm Why P=0.05?]}}

He then computed a table of values, similar to Elderton but, importantly, reversed the roles of χ<sup>2</sup> and ''p.'' That is, rather than computing ''p'' for different values of χ<sup>2</sup> (and degrees of freedom ''n''), he computed values of χ<sup>2</sup> that yield specified ''p''-values, specifically 0.99, 0.98, 0.95, 0,90, 0.80, 0.70, 0.50, 0.30, 0.20, 0.10, 0.05, 0.02, and 0.01.{{sfn|Fisher|1925|pp=78–79, 98|loc=Chapter [http://psychclassics.yorku.ca/Fisher/Methods/chap4.htm IV. Tests of Goodness of Fit, Independence and Homogeneity; with Table of χ<sup>2</sup>], [http://psychclassics.yorku.ca/Fisher/Methods/tabIII.gif Table III. Table of χ<sup>2</sup>]}} That allowed computed values of χ<sup>2</sup> to be compared against cutoffs and encouraged the use of ''p''-values (especially 0.05, 0.02, and 0.01) as cutoffs, instead of computing and reporting ''p''-values themselves. The same type of tables were then compiled in {{Harv|Fisher|Yates|1938}}, which cemented the approach.{{sfn|Dallal|2012|loc=Note 31: [http://www.jerrydallal.com/LHSP/p05.htm Why P=0.05?]}}

As an illustration of the application of ''p''-values to the design and interpretation of experiments, in his following book ''[[The Design of Experiments]]'' (1935), Fisher presented the [[lady tasting tea]] experiment,{{sfn|Fisher|1971|loc=II. The Principles of Experimentation, Illustrated by a Psycho-physical Experiment}} which is the archetypal example of the ''p''-value.

To evaluate a lady's claim that she ([[Muriel Bristol]]) could distinguish by taste how tea is prepared (first adding the milk to the cup, then the tea, or first tea, then milk), she was sequentially presented with 8 cups: 4 prepared one way, 4 prepared the other, and asked to determine the preparation of each cup (knowing that there were 4 of each). In that case, the null hypothesis was that she had no special ability, the test was [[Fisher's exact test]], and the ''p''-value was <math>1/\binom{8}{4} = 1/70 \approx 0.014,</math> so Fisher was willing to reject the null hypothesis (consider the outcome highly unlikely to be due to chance) if all were classified correctly. (In the actual experiment, Bristol correctly classified all 8 cups.)

Fisher reiterated the ''p'' = 0.05 threshold and explained its rationale, stating:{{sfn|Fisher|1971|loc=Section 7. The Test of Significance}}
{{blockquote
|It is usual and convenient for experimenters to take 5 per cent as a standard level of significance, in the sense that they are prepared to ignore all results which fail to reach this standard, and, by this means, to eliminate from further discussion the greater part of the fluctuations which chance causes have introduced into their experimental results.}}
He also applies this threshold to the design of experiments, noting that had only 6 cups been presented (3 of each), a perfect classification would have only yielded a ''p''-value of <math>1/\binom{6}{3} = 1/20 = 0.05,</math> which would not have met this level of significance.{{sfn|Fisher|1971|loc=Section 7. The Test of Significance}} Fisher also underlined the interpretation of ''p,'' as the long-run proportion of values at least as extreme as the data, assuming the null hypothesis is true.

In later editions, Fisher explicitly contrasted the use of the ''p''-value for statistical inference in science with the Neyman–Pearson method, which he terms "Acceptance Procedures".{{sfn|Fisher|1971|loc=Section 12.1 Scientific Inference and Acceptance Procedures}} Fisher emphasizes that while fixed levels such as 5%, 2%, and 1% are convenient, the exact ''p''-value can be used, and the strength of evidence can and will be revised with further experimentation. In contrast, decision procedures require a clear-cut decision, yielding an irreversible action, and the procedure is based on costs of error, which, he argues, are inapplicable to scientific research.