Editing Multinomial distribution (section)

=== Confidence intervals for the difference of two proportions ===

In the setting of a multinomial distribution, constructing confidence intervals for the difference between the proportions of observations from two events, <math>p_i-p_j</math>, requires the incorporation of the negative covariance between the sample estimators <math>\hat{p}_i  = \frac{X_i}{n} </math> and <math>\hat{p}_j  = \frac{X_j}{n}</math>.

Some of the literature on the subject focused on the use-case of matched-pairs binary data, which requires careful attention when translating the formulas to the general case of <math>p_i-p_j</math> for any multinomial distribution. Formulas in the current section will be generalized, while formulas in the next section will focus on the matched-pairs binary data use-case.

Wald's standard error (SE) of the difference of proportion can be estimated using:<ref>{{Cite book
 | last1 = Fleiss
 | first1 = Joseph L.
 | last2 = Levin
 | first2 = Bruce
 | last3 = Paik
 | first3 = Myunghee Cho
 | title = Statistical Methods for Rates and Proportions
 | edition = 3rd
 | publisher = J. Wiley
 | year = 2003
 | isbn = 9780471526292
 | location = Hoboken, N.J
 | pages = 760
}}</ref>{{rp|378}}<ref>{{Cite journal
 | last1 = Newcombe
 | first1 = R. G.
 | title = Interval Estimation for the Difference Between Independent Proportions: Comparison of Eleven Methods
 | journal = Statistics in Medicine
 | year = 1998
 | volume = 17
 | issue = 8
 | pages = 873–890
 | doi = 10.1002/(SICI)1097-0258(19980430)17:8<873::AID-SIM779>3.0.CO;2-I
| pmid = 9595617
 }}</ref>

<math>
\widehat{\operatorname{SE}(\hat{p}_i - \hat{p}_j)} = \sqrt{\frac{(\hat{p}_i + \hat{p}_j) - (\hat{p}_i - \hat{p}_j)^2}{n}}
</math>

For a <math>100(1 - \alpha)\%</math> [[Confidence interval#Approximate confidence intervals|approximate confidence interval]], the [[margin of error]] may incorporate the appropriate quantile from the [[standard normal distribution]], as follows:

<math>(\hat{p}_i - \hat{p}_j) \pm z_{\alpha/2} \cdot \widehat{\operatorname{SE}(\hat{p}_i - \hat{p}_j)}</math>

{{hidden begin|style=width:100%|ta1=center|border=1px #aaa solid|title=[Proof]}}
As the sample size (<math>n</math>) increases, the sample proportions will approximately follow a [[multivariate normal distribution]], thanks to the [[Central limit theorem#Multidimensional CLT|multidimensional central limit theorem]] (and it could also be shown using the [[Cramér–Wold theorem]]). Therefore, their difference will also be approximately normal. Also, these estimators are [[Consistent estimator|weakly consistent]] and plugging them into the SE estimator makes it also weakly consistent. Hence, thanks to [[Slutsky's theorem]], the [[pivotal quantity]]  <math>\frac{(\hat{p}_i - \hat{p}_j) - (p_i - p_j)}{\widehat{\operatorname{SE}(\hat{p}_i - \hat{p}_j)}}</math>  approximately follows the [[standard normal distribution]]. And from that, the above [[Confidence interval#Approximate confidence intervals|approximate confidence interval]] is directly derived.

The SE can be constructed using the calculus of [[Variance#Addition and multiplication by a constant|the variance of the difference of two random variables]]:
<math>
\begin{align}
\widehat{\operatorname{SE}(\hat{p}_i - \hat{p}_j)} & = \sqrt{\frac{\hat{p}_i (1 - \hat{p}_i)}{n} + \frac{\hat{p}_j (1 - \hat{p}_j)}{n} - 2\left(-\frac{\hat{p}_i \hat{p}_j}{n}\right)} \\
& = \sqrt{\frac{1}{n} \left(\hat{p}_i + \hat{p}_j - \hat{p}_i^2 - \hat{p}_j^2 + 2\hat{p}_i \hat{p}_j\right)} \\
& = \sqrt{\frac{(\hat{p}_i + \hat{p}_j) - (\hat{p}_i - \hat{p}_j)^2}{n}}
\end{align}
</math>

{{hidden end}}

A modification which includes a [[continuity correction]] adds <math>\frac{1}{n}</math> to the margin of error as follows:<ref name=pass_sample_size_software>{{Cite web|url=https://www.ncss.com/wp-content/themes/ncss/pdf/Procedures/PASS/Confidence_Intervals_for_the_Difference_Between_Two_Correlated_Proportions.pdf|title=Confidence Intervals for the Difference Between Two Correlated Proportions|publisher=NCSS|access-date=2022-03-22}}</ref>{{rp|102–3}}

<math>(\hat{p}_i - \hat{p}_j) \pm \left(z_{\alpha/2} \cdot \widehat{\operatorname{SE}(\hat{p}_i - \hat{p}_j)} + \frac{1}{n}\right)</math>

Another alternative is to rely on a Bayesian estimator using [[Jeffreys prior]] which leads to using a [[dirichlet distribution]], with all parameters being equal to 0.5, as a prior. The posterior will be the calculations from above, but after adding 1/2 to each of the ''k'' elements, leading to an overall increase of the sample size by <math>\frac{k}{2}</math>. This was originally developed for a multinomial distribution with four events, and is known as ''wald+2'', for analyzing matched pairs data (see the next section for more details).<ref name=Agresti2005>{{Cite journal
 | last1 = Agresti
 | first1 = Alan
 | last2 = Min
 | first2 = Yongyi
 | title = Simple improved confidence intervals for comparing matched proportions
 | journal = Statistics in Medicine
 | year = 2005
 | volume = 24
 | issue = 5
 | pages = 729–740
 | doi = 10.1002/sim.1781
 | pmid = 15696504
 | url = https://users.stat.ufl.edu/~aa/articles/agresti_min_2005b.pdf
}}</ref>

This leads to the following SE:

<math>
\widehat{\operatorname{SE}(\hat{p}_i - \hat{p}_j)}_{wald+\frac{k}{2}} = 
\sqrt{\frac{\left(\hat{p}_i + \hat{p}_j + \frac{1}{n}\right)\frac{n}{n+\frac{k}{2}} - 
\left(\hat{p}_i - \hat{p}_j\right)^2 \left(\frac{n}{n+\frac{k}{2}}\right)^2 }{n+\frac{k}{2}}}

</math>

{{hidden begin|style=width:100%|ta1=center|border=1px #aaa solid|title=[Proof]}}
<math>
\begin{align}
\widehat{\operatorname{SE}(\hat{p}_i - \hat{p}_j)}_{wald+\frac{k}{2}} & = \sqrt{\frac{\left(\frac{x_i+1/2}{n+\frac{k}{2}} + \frac{x_j+1/2}{n+\frac{k}{2}}\right) - \left(\frac{x_i+1/2}{n+\frac{k}{2}} - \frac{x_j+1/2}{n+\frac{k}{2}}\right)^2}{n+\frac{k}{2}}} \\
 & = 
\sqrt{\frac{\left(\frac{x_i}{n} + \frac{x_j}{n} + \frac{1}{n}\right)\frac{n}{n+\frac{k}{2}} - \left(\frac{x_i}{n} - \frac{x_j}{n}\right)^2 \left(\frac{n}{n+\frac{k}{2}}\right)^2 }{n+\frac{k}{2}}} \\
& = \sqrt{\frac{\left(\hat{p}_i + \hat{p}_j + \frac{1}{n}\right)\frac{n}{n+\frac{k}{2}} - \left(\hat{p}_i - \hat{p}_j\right)^2 \left(\frac{n}{n+\frac{k}{2}}\right)^2 }{n+\frac{k}{2}}} 
\end{align}
</math>
{{hidden end}}

Which can just be plugged into the original Wald formula as follows:

<math>(p_i - p_j)\frac{n}{n+\frac{k}{2}} \pm z_{\alpha/2} \cdot \widehat{\operatorname{SE}(\hat{p}_i - \hat{p}_j)}_{wald+\frac{k}{2}}</math>