Editing Chi-squared test (section)

== History ==
In the 19th century, statistical analytical methods were mainly applied in biological data analysis and it was customary for researchers to assume that observations followed a [[normal distribution]], such as [[Sir George Airy]] and [[Mansfield Merriman]], whose works were criticized by [[Karl Pearson]] in his 1900 paper.<ref name = Pearson1900>
{{cite journal
 | last = Pearson | first = Karl
 | author-link = Karl Pearson
 | title = On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling
 | journal = Philosophical Magazine |series=Series 5
 | volume = 50
 | issue = 302
 | year = 1900
 | pages = 157–175
 | url = https://www.tandfonline.com/doi/abs/10.1080/14786440009463897
 | doi = 10.1080/14786440009463897
}}</ref>

At the end of the 19th century, Pearson noticed the existence of significant [[skewness]] within some biological observations. In order to model the observations regardless of being normal or skewed, Pearson, in a series of articles published from 1893 to 1916,<ref name = Pearson1893>
{{cite journal
 | last = Pearson | first = Karl
 | author-link = Karl Pearson
 | title = Contributions to the mathematical theory of evolution [abstract]
 | journal = Proceedings of the Royal Society
 | volume = 54
 | year = 1893
 | pages = 329–333
 | jstor = 115538
 | doi = 10.1098/rspl.1893.0079
| doi-access = free
 }}
</ref><ref name = Pearson1895>
{{cite journal
 | last = Pearson | first = Karl
 | author-link = Karl Pearson
 | title = Contributions to the mathematical theory of evolution, II: Skew variation in homogeneous material
 | journal = Philosophical Transactions of the Royal Society
 | volume = 186
 | year = 1895
 | pages = 343–414
 | bibcode = 1895RSPTA.186..343P
 | jstor = 90649
 | doi = 10.1098/rsta.1895.0010
| url = https://zenodo.org/record/1432104
 | doi-access = free
 }}
</ref><ref name = Pearson1901>
{{cite journal
 | last = Pearson | first = Karl
 | author-link = Karl Pearson
 | title = Mathematical contributions to the theory of evolution, X: Supplement to a memoir on skew variation
 | journal = Philosophical Transactions of the Royal Society A
 | volume = 197
 | issue = 287–299
 | year = 1901
 | pages = 443–459
 | bibcode = 1901RSPTA.197..443P
 | jstor = 90841
 | doi = 10.1098/rsta.1901.0023
| doi-access = 
 }}
</ref><ref name = Pearson1916>
{{cite journal
 | last = Pearson | first = Karl
 | author-link = Karl Pearson
 | title = Mathematical contributions to the theory of evolution, XIX: Second supplement to a memoir on skew variation
 | journal = Philosophical Transactions of the Royal Society A
 | volume = 216
 | issue = 538–548
 | year = 1916
 | pages = 429–457
 | bibcode = 1916RSPTA.216..429P
 | jstor = 91092
 | doi = 10.1098/rsta.1916.0009
| doi-access = free
 }}
</ref> devised the [[Pearson distribution]], a family of continuous [[probability distribution]]s, which includes the normal distribution and many skewed distributions, and proposed a method of statistical analysis consisting of using the Pearson distribution to model the observation and performing a test of goodness of fit to determine how well the model really fits to the observations.

=== Pearson's chi-squared test ===
{{See also|Pearson's chi-squared test}}

In 1900, Pearson published a paper<ref name = Pearson1900 /> on the {{math|χ<sup>2</sup>}} test which is considered to be one of the foundations of modern statistics.<ref name = Cochran1952>
{{cite journal
 | last = Cochran | first = William G.
 | author-link = William G. Cochran
 | title = The Chi-square Test of Goodness of Fit
 | journal = The Annals of Mathematical Statistics
 | volume = 23
 | issue = 3
 | year = 1952
 | pages = 315–345
 | jstor = 2236678
 | doi=10.1214/aoms/1177729380
| doi-access = free
 }}
</ref> In this paper, Pearson investigated a test of goodness of fit.

Suppose that {{mvar|n}} observations in a random sample from a population are classified into {{mvar|k}} mutually exclusive classes with respective observed numbers of observations {{mvar|x<sub>i</sub>}} (for {{math|''i'' {{=}} 1,2,…,''k''}}), and a null hypothesis gives the probability {{mvar|p<sub>i</sub>}} that an observation falls into the {{mvar|i}}th class. So we have the expected numbers {{math|''m<sub>i</sub>'' {{=}} ''np<sub>i</sub>''}} for all {{mvar|i}}, where

:<math>\begin{align}
& \sum^k_{i=1}{p_i} = 1 \\[8pt]
& \sum^k_{i=1}{m_i} = n\sum^k_{i=1}{p_i} = n  
\end{align}</math>

Pearson proposed that, under the circumstance of the null hypothesis being correct, as {{math|''n'' → ∞}} the limiting distribution of the quantity given below is the {{math|χ<sup>2</sup>}} distribution.

:<math>X^2=\sum^k_{i=1}{\frac{(x_i-m_i)^2}{m_i}}=\sum^k_{i=1}{\frac{x_i^2}{m_i}-n}</math>

Pearson dealt first with the case in which the expected numbers {{mvar|m<sub>i</sub>}} are large enough known numbers in all cells assuming every observation {{mvar|x<sub>i</sub>}} may be taken as [[normal distribution|normally distributed]], and reached the result that, in the limit as {{mvar|n}} becomes large, {{math|''X''{{isup|2}}}} follows the {{math|χ<sup>2</sup>}} distribution with {{math|''k'' − 1}} degrees of freedom.

However, Pearson next considered the case in which the expected numbers depended on the parameters that had to be estimated from the sample, and suggested that, with the notation of {{mvar|m<sub>i</sub>}} being the true expected numbers and {{math|''m''′<sub>''i''</sub>}} being the estimated expected numbers, the difference

:<math>X^2-{X'}^2=\sum^k_{i=1}{\frac{x_i^2}{m_i}}-\sum^k_{i=1}{\frac{x_i^2}{m'_i}}</math>

will usually be positive and small enough to be omitted. In a conclusion, Pearson argued that if we regarded {{math|''X''′{{isup|2}}}} as also distributed as {{math|χ<sup>2</sup>}} distribution with {{math|''k'' − 1}} degrees of freedom, the error in this approximation would not affect practical decisions. This conclusion caused some controversy in practical applications and was not settled for 20 years until Fisher's 1922 and 1924 papers.<ref name = Fisher1922>

{{cite journal
 | last = Fisher | first = Ronald A.
 | author-link = Ronald A. Fisher
 | title = On the Interpretation of {{math|χ<sup>2</sup>}} from Contingency Tables, and the Calculation of P
 | journal = Journal of the Royal Statistical Society
 | volume = 85
 | issue = 1
 | year = 1922
 | pages = 87–94
 | jstor = 2340521
 | doi=10.2307/2340521
 }}

</ref><ref name = Fisher1924>
{{cite journal
 | last = Fisher | first = Ronald A.
 | author-link = Ronald A. Fisher
 | title = The Conditions Under Which {{math|χ<sup>2</sup>}} Measures the Discrepancey Between Observation and Hypothesis
 | journal = Journal of the Royal Statistical Society
 | volume = 87
 | issue = 3
 | year = 1924
 | pages = 442–450
 | jstor = 2341149
}}</ref>