Editing Chi-squared test (section)

=== Pearson's chi-squared test ===
{{See also|Pearson's chi-squared test}}

In 1900, Pearson published a paper<ref name = Pearson1900 /> on the {{math|χ<sup>2</sup>}} test which is considered to be one of the foundations of modern statistics.<ref name = Cochran1952>
{{cite journal
 | last = Cochran | first = William G.
 | author-link = William G. Cochran
 | title = The Chi-square Test of Goodness of Fit
 | journal = The Annals of Mathematical Statistics
 | volume = 23
 | issue = 3
 | year = 1952
 | pages = 315–345
 | jstor = 2236678
 | doi=10.1214/aoms/1177729380
| doi-access = free
 }}
</ref> In this paper, Pearson investigated a test of goodness of fit.

Suppose that {{mvar|n}} observations in a random sample from a population are classified into {{mvar|k}} mutually exclusive classes with respective observed numbers of observations {{mvar|x<sub>i</sub>}} (for {{math|''i'' {{=}} 1,2,…,''k''}}), and a null hypothesis gives the probability {{mvar|p<sub>i</sub>}} that an observation falls into the {{mvar|i}}th class. So we have the expected numbers {{math|''m<sub>i</sub>'' {{=}} ''np<sub>i</sub>''}} for all {{mvar|i}}, where

:<math>\begin{align}
& \sum^k_{i=1}{p_i} = 1 \\[8pt]
& \sum^k_{i=1}{m_i} = n\sum^k_{i=1}{p_i} = n  
\end{align}</math>

Pearson proposed that, under the circumstance of the null hypothesis being correct, as {{math|''n'' → ∞}} the limiting distribution of the quantity given below is the {{math|χ<sup>2</sup>}} distribution.

:<math>X^2=\sum^k_{i=1}{\frac{(x_i-m_i)^2}{m_i}}=\sum^k_{i=1}{\frac{x_i^2}{m_i}-n}</math>

Pearson dealt first with the case in which the expected numbers {{mvar|m<sub>i</sub>}} are large enough known numbers in all cells assuming every observation {{mvar|x<sub>i</sub>}} may be taken as [[normal distribution|normally distributed]], and reached the result that, in the limit as {{mvar|n}} becomes large, {{math|''X''{{isup|2}}}} follows the {{math|χ<sup>2</sup>}} distribution with {{math|''k'' − 1}} degrees of freedom.

However, Pearson next considered the case in which the expected numbers depended on the parameters that had to be estimated from the sample, and suggested that, with the notation of {{mvar|m<sub>i</sub>}} being the true expected numbers and {{math|''m''′<sub>''i''</sub>}} being the estimated expected numbers, the difference

:<math>X^2-{X'}^2=\sum^k_{i=1}{\frac{x_i^2}{m_i}}-\sum^k_{i=1}{\frac{x_i^2}{m'_i}}</math>

will usually be positive and small enough to be omitted. In a conclusion, Pearson argued that if we regarded {{math|''X''′{{isup|2}}}} as also distributed as {{math|χ<sup>2</sup>}} distribution with {{math|''k'' − 1}} degrees of freedom, the error in this approximation would not affect practical decisions. This conclusion caused some controversy in practical applications and was not settled for 20 years until Fisher's 1922 and 1924 papers.<ref name = Fisher1922>

{{cite journal
 | last = Fisher | first = Ronald A.
 | author-link = Ronald A. Fisher
 | title = On the Interpretation of {{math|χ<sup>2</sup>}} from Contingency Tables, and the Calculation of P
 | journal = Journal of the Royal Statistical Society
 | volume = 85
 | issue = 1
 | year = 1922
 | pages = 87–94
 | jstor = 2340521
 | doi=10.2307/2340521
 }}

</ref><ref name = Fisher1924>
{{cite journal
 | last = Fisher | first = Ronald A.
 | author-link = Ronald A. Fisher
 | title = The Conditions Under Which {{math|χ<sup>2</sup>}} Measures the Discrepancey Between Observation and Hypothesis
 | journal = Journal of the Royal Statistical Society
 | volume = 87
 | issue = 3
 | year = 1924
 | pages = 442–450
 | jstor = 2341149
}}</ref>