Editing Chi-squared test (section)

==Example chi-squared test for categorical data==

Suppose there is a city of 1,000,000 residents with four neighborhoods: {{math|''A''}}, {{math|''B''}}, {{math|''C''}}, and {{math|''D''}}. A random sample of 650 residents of the city is taken and their occupation is recorded as [[Collar workers|"white collar", "blue collar", or "no collar"]]. The null hypothesis is that each person's neighborhood of residence is independent of the person's occupational classification. The data are tabulated as:

:{| class="wikitable" style="text-align: right;"
|-
! !! {{math|''A''}} !! {{math|''B''}} !! {{math|''C''}} !! {{math|''D''}} !! Total
|-
|style="text-align: left;"| White collar || 90 || 60 || 104 || 95 || 349 
|-
|style="text-align: left;"| Blue collar || 30 || 50 || 51 || 20 || 151
|-
|style="text-align: left;"| No collar || 30 || 40 || 45 || 35 || 150
|-
!style="text-align: left;"| Total || 150 || 150 || 200 || 150 || 650
|}

Let us take the sample living in neighborhood {{math|''A''}}, 150, to estimate what proportion of the whole 1,000,000 live in neighborhood {{math|''A''}}. Similarly we take {{sfrac|349|650}} to estimate what proportion of the 1,000,000 are white-collar workers. By the assumption of independence under the hypothesis we should "expect" the number of white-collar workers in neighborhood {{math|''A''}} to be

: <math> 150\times\frac{349}{650} \approx 80.54 </math>

Then in that "cell" of the table, we have

: <math>\frac{\left(\text{observed}-\text{expected}\right)^2}{\text{expected}} = \frac{\left(90-80.54\right)^2}{80.54} \approx 1.11</math>

The sum of these quantities over all of the cells is the test statistic; in this case, <math> \approx 24.57 </math>. Under the null hypothesis, this sum has approximately a chi-squared distribution whose number of degrees of freedom is

: <math> (\text{number of rows}-1)(\text{number of columns}-1) = (3-1)(4-1) = 6 </math>

If the test statistic is improbably large according to that chi-squared distribution, then one rejects the null hypothesis of independence.

A related issue is a test of homogeneity. Suppose that instead of giving every resident of each of the four neighborhoods an equal chance of inclusion in the sample, we decide in advance how many residents of each neighborhood to include. Then each resident has the same chance of being chosen as do all residents of the same neighborhood, but residents of different neighborhoods would have different probabilities of being chosen if the four sample sizes are not proportional to the populations of the four neighborhoods. In such a case, we would be testing "homogeneity" rather than "independence". The question is whether the proportions of blue-collar, white-collar, and no-collar workers in the four neighborhoods are the same. However, the test is done in the same way.