Editing Multinomial distribution (section)

== Occurrence and applications ==

=== Confidence intervals for the difference in matched-pairs binary data (using multinomial with ''k=4'') ===

For the case of matched-pairs binary data, a common task is to build the confidence interval of the difference of the proportion of the matched events. For example, we might have a test for some disease, and we may want to check the results of it for some population at two points in time (1 and 2), to check if there was a change in the proportion of the positives for the disease during that time.

Such scenarios can be represented using a two-by-two [[contingency table]] with the number of elements that had each of the combination of events. We can use small ''f'' for sampling frequencies: <math>f_{11}, f_{10}, f_{01}, f_{00}</math>, and capital ''F'' for population frequencies: <math>F_{11}, F_{10}, F_{01}, F_{00}</math>. These four combinations could be modeled as coming from a multinomial distribution (with four potential outcomes). The sizes of the sample and population can be ''n'' and ''N'' respectively. And in such a case, there is an interest in building a confidence interval for the difference of proportions from the marginals of the following (sampled) contingency table:

{| class="wikitable" style="text-align:center; margin:1em auto;"
|-
|         || Test 2 positive || Test 2 negative  || Row total
|-
| Test 1 positive || <math>f_{11}</math> || <math>f_{10}</math> || <math>f_{1*} = f_{11} + f_{10}</math>
|-
| Test 1 negative || <math>f_{01}</math> || <math>f_{00}</math> || <math>f_{0*} = f_{01} + f_{00}</math>
|-
| Column total || <math>f_{*1} = f_{11} + f_{01}</math> || <math>f_{*0} = f_{10} + f_{00}</math> || <math>n</math>
|}

In this case, checking the difference in marginal proportions means we are interested in using the following definitions: <math>p_{1*} = \frac{F_{1*}}{N} = \frac{F_{11} + F_{10}}{N}</math>, <math>p_{*1} = \frac{F_{*1}}{N} = \frac{F_{11} + F_{01}}{N}</math>.
And the difference we want to build confidence intervals for is:

<math>p_{*1} - p_{1*} = \frac{F_{11} + F_{01}}{N} - \frac{F_{11} + F_{10}}{N} = \frac{F_{01}}{N} - \frac{F_{10}}{N} = p_{01} - p_{10}</math>

Hence, a confidence intervals for the marginal positive proportions (<math>p_{*1} - p_{1*}</math>) is the same as building a confidence interval for the difference of the proportions from the secondary diagonal of the two-by-two contingency table (<math>p_{01} - p_{10}</math>).

Calculating a [[p-value]] for such a difference is known as [[McNemar's test]]. Building confidence interval around it can be constructed using methods described above for [[Multinomial distribution#Confidence intervals for the difference of two proportions|Confidence intervals for the difference of two proportions]].

The Wald confidence intervals from the previous section can be applied to this setting, and appears in the literature using alternative notations. Specifically, the SE often presented is based on the contingency table frequencies instead of the sample proportions. For example, the Wald confidence intervals, provided above, can be written as:<ref name=pass_sample_size_software />{{rp|102–3}}

<math>
\widehat{\operatorname{SE}(p_{*1} - p_{1*})} = \widehat{\operatorname{SE}(p_{01} - p_{10})} = \frac{\sqrt{n(f_{10} + f_{01}) - (f_{10} - f_{01})^2}}{n\sqrt{n}}
</math>

Further research in the literature has identified several shortcomings in both the Wald and the Wald with continuity correction methods, and other methods have been proposed for practical application.<ref name=pass_sample_size_software />

One such modification includes ''Agresti and Min’s Wald+2'' (similar to some of their other works<ref>{{Cite journal
 | last1 = Agresti
 | first1 = A.
 | last2 = Caffo
 | first2 = B.
 | title = Simple and effective confidence intervals for proportions and difference of proportions result from adding two successes and two failures
 | journal = The American Statistician
 | year = 2000
 | volume = 54
 | issue = 4
 | pages = 280–288
| doi = 10.1080/00031305.2000.10474560
 }}</ref>) in which each cell frequency had an extra <math>\frac{1}{2}</math> added to it.<ref name=Agresti2005/> This leads to the ''Wald+2'' confidence intervals. In a Bayesian interpretation, this is like building the estimators taking as prior a [[dirichlet distribution]] with all parameters being equal to 0.5 (which is, in fact, the [[Jeffreys prior]]). The ''+2'' in the name ''wald+2'' can now be taken to mean that in the context of a two-by-two contingency table, which is a multinomial distribution with four possible events, then since we add 1/2 an observation to each of them, then this translates to an overall addition of 2 observations (due to the prior).

This leads to the following modified SE for the case of matched pairs data:

<math>
\widehat{\operatorname{SE}(p_{*1} - p_{1*})}  = \frac{\sqrt{(n+2)(f_{10} + f_{01} + 1) - (f_{10} - f_{01})^2}}{(n+2)\sqrt{(n+2)}}
</math>

Which can just be plugged into the original Wald formula as follows:

<math>(p_{*1} - p_{1*})\frac{n}{n+2} \pm z_{\alpha/2} \cdot \widehat{\operatorname{SE}(\hat{p}_i - \hat{p}_j)}_{wald+2}</math>

Other modifications include ''Bonett and Price’s Adjusted Wald'', and ''Newcombe’s Score''.