Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Pearson's chi-squared test
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
===Two cells=== In the special case where there are only two cells in the table, the expected values follow a [[binomial distribution]], <math display="block"> O \ \sim \ \mathrm{Bin}(n,p), \, </math> where *''p'' = probability, under the null hypothesis, *''n'' = number of observations in the sample. In the above example the hypothesised probability of a male observation is 0.5, with 100 samples. Thus we expect to observe 50 males. If ''n'' is sufficiently large, the above binomial distribution may be approximated by a Gaussian (normal) distribution and thus the Pearson test statistic approximates a chi-squared distribution, <math display="block"> \text{Bin}(n,p) \approx \text{N}(np, np(1-p)). \, </math> Let ''O''<sub>1</sub> be the number of observations from the sample that are in the first cell. The Pearson test statistic can be expressed as <math display="block"> \frac{(O_1-np)^2}{np} + \frac{(n-O_1-n(1-p))^2}{n(1-p)}, </math> which can in turn be expressed as <math display="block"> \left(\frac{O_1-np}{\sqrt{np(1-p)}}\right)^2. </math> By the normal approximation to a binomial this is the squared of one standard normal variate, and hence is distributed as chi-squared with 1 degree of freedom. Note that the denominator is one standard deviation of the Gaussian approximation, so can be written <math display="block"> \frac{{\left(O_1 - \mu\right)}^2}{\sigma^2}. </math> So as consistent with the meaning of the chi-squared distribution, we are measuring how probable the observed number of standard deviations away from the mean is under the Gaussian approximation (which is a good approximation for large ''n''). The chi-squared distribution is then integrated on the right of the statistic value to obtain the [[P-value]], which is equal to the probability of getting a statistic equal or bigger than the observed one, assuming the null hypothesis. ===Two-by-two contingency tables=== When the test is applied to a [[contingency table]] containing two rows and two columns, the test is equivalent to a [[Z-test]] of proportions.{{Citation needed|date=September 2018|reason=Claim needs a citation or more precise context -- we were unable to reproduce this using standard library functions in Python or R}} ===Many cells=== Broadly similar arguments as above lead to the desired result, though the details are more involved. One may apply an orthogonal change of variables to turn the limiting summands in the test statistic into one fewer squares of i.i.d. standard normal random variables.<ref>{{cite arXiv |title=Seven Proofs of the Pearson Chi-Squared Independence Test and its Graphical Interpretation |date=3 September 2018 | last1=Benhamou | first1=Eric | last2=Melot | first2=Valentin | eprint=1808.09171 |class=math.ST |pages=5–6 <!--|url=https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3239829 |doi=10.2139/ssrn.3239829 | ssrn=3239829 |s2cid=88524653--> }}</ref> Let us now prove that the distribution indeed approaches asymptotically the <math>\chi^2</math> distribution as the number of observations approaches infinity. Let <math>n</math> be the number of observations, <math>m</math> the number of cells and <math>p_i</math> the probability of an observation to fall in the i-th cell, for <math>1\le i\le m</math>. We denote by <math>\{k_i\}</math> the configuration where for each i there are <math>k_i</math> observations in the i-th cell. Note that <math display="block">\sum_{i=1}^m k_i = n \qquad \text{and} \qquad \sum_{i=1}^m p_i = 1.</math> Let <math>\chi^2_P(\{k_i\},\{p_i\})</math> be Pearson's cumulative test statistic for such a configuration, and let <math>\chi^2_P(\{p_i\})</math> be the distribution of this statistic. We will show that the latter probability approaches the <math>\chi^2</math> distribution with <math>m-1</math> degrees of freedom, as <math>n \to \infty.</math> For any arbitrary value T: <math display="block"> P(\chi^2_P(\{p_i\}) > T) = \sum_{\{k_i|\chi^2_P(\{k_i\},\{p_i\}) > T\}} \frac{n!}{k_1! \cdots k_m!} \prod_{i=1}^m {p_i}^{k_i} </math> We will use a procedure similar to the approximation in [[de Moivre–Laplace theorem]]. Contributions from small <math>k_i</math> are of subleading order in <math>n</math> and thus for large <math>n</math> we may use [[Stirling's formula]] for both <math>n!</math> and <math>k_i!</math> to get the following: <math display="block">P(\chi^2_P(\{p_i\}) > T) \sim \sum_{\{k_i|\chi^2_P(\{k_i\},\{p_i\}) > T \}} \prod_{i=1}^m \left (\frac{np_i}{k_i}\right)^{k_i} \sqrt{\frac{2\pi n}{\prod_{i=1}^m 2\pi k_i}}</math> By substituting for <math display="block">x_i = \frac{k_i-np_i}{\sqrt{n}}, \qquad i = 1, \cdots, m-1, </math> we may approximate for large <math>n</math> the sum over the <math>k_i</math> by an integral over the <math>x_i</math>. Noting that: <math display="block">k_m = np_m-\sqrt{n} \sum_{i=1}^{m-1}x_i,</math> we arrive at <math display="block"> \begin{align} P(\chi^2_P (\{p_i\}) > T) &\sim \sqrt{\frac{2\pi n}{\prod_{i=1}^m 2\pi k_i}} \int_\Omega \left[ \prod_{i=1}^{m-1} \sqrt{n} dx_i \right] \times \\ &\qquad \qquad \times \left \{\prod_{i=1}^{m-1} \left (1+\frac{x_i}{\sqrt{n} p_i}\right)^{-(n p_i + \sqrt{n} x_i) } \left(1-\frac{\sum_{i=1}^{m-1}{x_i}}{\sqrt{n} p_m}\right)^{-\left(n p_m-\sqrt{n} \sum_{i=1}^{m-1}x_i\right)} \right\} \\[1ex] &= \sqrt{\frac{2\pi n}{\prod_{i=1}^m \left (2\pi n p_i + 2\pi \sqrt{n} x_i\right)}} \int_\Omega \left \{\prod_{i=1}^{m-1} {\sqrt{n} dx_i}\right \}\times \\ &\qquad \qquad \times \left \{ \prod_{i=1}^{m-1} \exp\left[-\left(n p_i + \sqrt{n} x_i \right) \ln \left(1+\frac{x_i}{\sqrt{n} p_i}\right)\right] \exp \left[ -\left(n p_m-\sqrt{n} \sum_{i=1}^{m-1}x_i\right) \ln \left(1-\frac{\sum_{i=1}^{m-1}{x_i}}{\sqrt{n}p_m}\right) \right] \right \} \end{align}</math> where <math>\Omega</math> is the set defined through <math>\chi^2_P(\{k_i\},\{p_i\}) = \chi^2_P(\{\sqrt{n} x_i+n p_i\},\{p_i\}) > T</math>.{{clarify|What exactly is the structure of the set?|date=April 2025}} By [[Taylor expansion|expanding]] the logarithm and taking the leading terms in <math>n</math>, we get <math display="block"> P(\chi^2_P(\{p_i\}) > T) \sim \frac{1}{\sqrt{(2\pi)^{m-1} \prod_{i=1}^{m} p_i}} \int_\Omega \left[ \prod_{i=1}^{m-1} dx_i\right] \prod_{i=1}^{m-1} \exp\left [-\frac{1}{2}\sum_{i=1}^{m-1}\frac{x_i^2}{p_i} -\frac{1}{2p_m}\left (\sum_{i=1}^{m-1}{x_i} \right )^2 \right]</math> Pearson's chi, <math>\chi^2_P(\{k_i\},\{p_i\}) = \chi^2_P(\{\sqrt{n} x_i+n p_i\},\{p_i\})</math>, is precisely the argument of the exponent (except for the −1/2; note that the final term in the exponent's argument is equal to <math>(k_m-n p_m)^2/(n p_m)</math>). This argument can be written as: <math display="block">-\frac{1}{2}\sum_{i,j=1}^{m-1}x_i A_{ij} x_j, \qquad i,j = 1, \cdots, m-1, \quad A_{ij} = \tfrac{\delta_{ij}}{p_i} + \tfrac{1}{p_m}.</math> <math>A</math> is a regular symmetric <math>(m-1) \times (m-1)</math> matrix, and hence [[diagonalizable]]. It is therefore possible to make a linear change of variables in <math>\{x_i\}</math> so as to get <math>m-1</math> new variables <math>\{y_i\}</math> so that: <math display="block">\sum_{i,j=1}^{m-1}x_i A_{ij} x_j = \sum_{i=1}^{m-1}y_i^2.</math> This linear change of variables merely multiplies the integral by a constant [[Jacobian matrix and determinant|Jacobian]], so we get: <math display="block">P(\chi^2_P(\{p_i\}) > T) \sim C \int_{\sum_{i=1}^{m-1} y_i^2 > T} \left\{\prod_{i=1}^{m-1} dy_i \right\} \prod_{i=1}^{m-1} \exp\left[-\frac{1}{2}\left(\sum_{i=1}^{m-1} y_i^2 \right)\right]</math> Where C is a constant. This is the probability that squared sum of <math>m-1</math> independent normally distributed variables of zero mean and unit variance will be greater than T, namely that <math>\chi^2</math> with <math>m-1</math> degrees of freedom is larger than T. We have thus shown that at the limit where <math>n \to \infty,</math> the distribution of Pearson's chi approaches the chi distribution with <math>m-1</math> degrees of freedom. {{hidden end}}An alternative derivation is on the [[Multinomial distribution#Large deviation theory|multinomial distribution page]].
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)