Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Multinomial distribution
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Statistical inference == {{Expand section|date=March 2024|with=A new sub-section about simultaneous confidence intervals (with proper citations, e.g.: [https://www.stat.cmu.edu/technometrics/59-69/VOL-07-02/v0702247.pdf]).}} ===Equivalence tests for multinomial distributions=== The goal of equivalence testing is to establish the agreement between a theoretical multinomial distribution and observed counting frequencies. The theoretical distribution may be a fully specified multinomial distribution or a parametric family of multinomial distributions. Let <math>q</math> denote a theoretical multinomial distribution and let <math>p</math> be a true underlying distribution. The distributions <math>p</math> and <math>q</math> are considered equivalent if <math>d(p,q)<\varepsilon</math> for a distance <math>d</math> and a tolerance parameter <math>\varepsilon>0</math>. The equivalence test problem is <math>H_0=\{d(p,q)\geq\varepsilon\}</math> versus <math>H_1=\{d(p,q)<\varepsilon\}</math>. The true underlying distribution <math>p</math> is unknown. Instead, the counting frequencies <math>p_n</math> are observed, where <math>n</math> is a sample size. An equivalence test uses <math>p_n</math> to reject <math>H_0</math>. If <math>H_0</math> can be rejected then the equivalence between <math>p</math> and <math>q</math> is shown at a given significance level. The equivalence test for Euclidean distance can be found in text book of Wellek (2010).<ref>{{Cite book|title=Testing statistical hypotheses of equivalence and noninferiority|last=Wellek|first=Stefan|publisher=Chapman and Hall/CRC|year=2010|isbn=978-1439808184}}</ref> The equivalence test for the total variation distance is developed in Ostrovski (2017).<ref>{{cite journal|last1=Ostrovski|first1=Vladimir|date=May 2017|title=Testing equivalence of multinomial distributions|journal=Statistics & Probability Letters|volume=124|pages=77β82|doi=10.1016/j.spl.2017.01.004|s2cid=126293429}}[http://dx.doi.org/10.1016/j.spl.2017.01.004 Official web link (subscription required)]. [https://www.researchgate.net/publication/312481284_Testing_equivalence_of_multinomial_distributions Alternate, free web link].</ref> The exact equivalence test for the specific cumulative distance is proposed in Frey (2009).<ref>{{cite journal|last1=Frey|first1=Jesse|date=March 2009|title=An exact multinomial test for equivalence|journal=The Canadian Journal of Statistics|volume=37|pages=47β59|doi=10.1002/cjs.10000|s2cid=122486567 }}[http://www.jstor.org/stable/25653460 Official web link (subscription required)].</ref> The distance between the true underlying distribution <math>p</math> and a family of the multinomial distributions <math>\mathcal{M}</math> is defined by <math>d(p, \mathcal{M})=\min_{h\in\mathcal{M}}d(p,h) </math>. Then the equivalence test problem is given by <math>H_0=\{d(p,\mathcal{M})\geq \varepsilon\}</math> and <math>H_1=\{d(p,\mathcal{M})< \varepsilon\}</math>. The distance <math>d(p,\mathcal{M})</math> is usually computed using numerical optimization. The tests for this case are developed recently in Ostrovski (2018).<ref>{{cite journal|last1=Ostrovski|first1=Vladimir|date=March 2018|title=Testing equivalence to families of multinomial distributions with application to the independence model|journal=Statistics & Probability Letters|volume=139|pages=61β66|doi=10.1016/j.spl.2018.03.014|s2cid=126261081}}[https://doi.org/10.1016/j.spl.2018.03.014 Official web link (subscription required)]. [https://www.researchgate.net/publication/324124605_Testing_equivalence_to_families_of_multinomial_distributions_with_application_to_the_independence_model Alternate, free web link].</ref> === Confidence intervals for the difference of two proportions === In the setting of a multinomial distribution, constructing confidence intervals for the difference between the proportions of observations from two events, <math>p_i-p_j</math>, requires the incorporation of the negative covariance between the sample estimators <math>\hat{p}_i = \frac{X_i}{n} </math> and <math>\hat{p}_j = \frac{X_j}{n}</math>. Some of the literature on the subject focused on the use-case of matched-pairs binary data, which requires careful attention when translating the formulas to the general case of <math>p_i-p_j</math> for any multinomial distribution. Formulas in the current section will be generalized, while formulas in the next section will focus on the matched-pairs binary data use-case. Wald's standard error (SE) of the difference of proportion can be estimated using:<ref>{{Cite book | last1 = Fleiss | first1 = Joseph L. | last2 = Levin | first2 = Bruce | last3 = Paik | first3 = Myunghee Cho | title = Statistical Methods for Rates and Proportions | edition = 3rd | publisher = J. Wiley | year = 2003 | isbn = 9780471526292 | location = Hoboken, N.J | pages = 760 }}</ref>{{rp|378}}<ref>{{Cite journal | last1 = Newcombe | first1 = R. G. | title = Interval Estimation for the Difference Between Independent Proportions: Comparison of Eleven Methods | journal = Statistics in Medicine | year = 1998 | volume = 17 | issue = 8 | pages = 873β890 | doi = 10.1002/(SICI)1097-0258(19980430)17:8<873::AID-SIM779>3.0.CO;2-I | pmid = 9595617 }}</ref> <math> \widehat{\operatorname{SE}(\hat{p}_i - \hat{p}_j)} = \sqrt{\frac{(\hat{p}_i + \hat{p}_j) - (\hat{p}_i - \hat{p}_j)^2}{n}} </math> For a <math>100(1 - \alpha)\%</math> [[Confidence interval#Approximate confidence intervals|approximate confidence interval]], the [[margin of error]] may incorporate the appropriate quantile from the [[standard normal distribution]], as follows: <math>(\hat{p}_i - \hat{p}_j) \pm z_{\alpha/2} \cdot \widehat{\operatorname{SE}(\hat{p}_i - \hat{p}_j)}</math> {{hidden begin|style=width:100%|ta1=center|border=1px #aaa solid|title=[Proof]}} As the sample size (<math>n</math>) increases, the sample proportions will approximately follow a [[multivariate normal distribution]], thanks to the [[Central limit theorem#Multidimensional CLT|multidimensional central limit theorem]] (and it could also be shown using the [[CramΓ©rβWold theorem]]). Therefore, their difference will also be approximately normal. Also, these estimators are [[Consistent estimator|weakly consistent]] and plugging them into the SE estimator makes it also weakly consistent. Hence, thanks to [[Slutsky's theorem]], the [[pivotal quantity]] <math>\frac{(\hat{p}_i - \hat{p}_j) - (p_i - p_j)}{\widehat{\operatorname{SE}(\hat{p}_i - \hat{p}_j)}}</math> approximately follows the [[standard normal distribution]]. And from that, the above [[Confidence interval#Approximate confidence intervals|approximate confidence interval]] is directly derived. The SE can be constructed using the calculus of [[Variance#Addition and multiplication by a constant|the variance of the difference of two random variables]]: <math> \begin{align} \widehat{\operatorname{SE}(\hat{p}_i - \hat{p}_j)} & = \sqrt{\frac{\hat{p}_i (1 - \hat{p}_i)}{n} + \frac{\hat{p}_j (1 - \hat{p}_j)}{n} - 2\left(-\frac{\hat{p}_i \hat{p}_j}{n}\right)} \\ & = \sqrt{\frac{1}{n} \left(\hat{p}_i + \hat{p}_j - \hat{p}_i^2 - \hat{p}_j^2 + 2\hat{p}_i \hat{p}_j\right)} \\ & = \sqrt{\frac{(\hat{p}_i + \hat{p}_j) - (\hat{p}_i - \hat{p}_j)^2}{n}} \end{align} </math> {{hidden end}} A modification which includes a [[continuity correction]] adds <math>\frac{1}{n}</math> to the margin of error as follows:<ref name=pass_sample_size_software>{{Cite web|url=https://www.ncss.com/wp-content/themes/ncss/pdf/Procedures/PASS/Confidence_Intervals_for_the_Difference_Between_Two_Correlated_Proportions.pdf|title=Confidence Intervals for the Difference Between Two Correlated Proportions|publisher=NCSS|access-date=2022-03-22}}</ref>{{rp|102β3}} <math>(\hat{p}_i - \hat{p}_j) \pm \left(z_{\alpha/2} \cdot \widehat{\operatorname{SE}(\hat{p}_i - \hat{p}_j)} + \frac{1}{n}\right)</math> Another alternative is to rely on a Bayesian estimator using [[Jeffreys prior]] which leads to using a [[dirichlet distribution]], with all parameters being equal to 0.5, as a prior. The posterior will be the calculations from above, but after adding 1/2 to each of the ''k'' elements, leading to an overall increase of the sample size by <math>\frac{k}{2}</math>. This was originally developed for a multinomial distribution with four events, and is known as ''wald+2'', for analyzing matched pairs data (see the next section for more details).<ref name=Agresti2005>{{Cite journal | last1 = Agresti | first1 = Alan | last2 = Min | first2 = Yongyi | title = Simple improved confidence intervals for comparing matched proportions | journal = Statistics in Medicine | year = 2005 | volume = 24 | issue = 5 | pages = 729β740 | doi = 10.1002/sim.1781 | pmid = 15696504 | url = https://users.stat.ufl.edu/~aa/articles/agresti_min_2005b.pdf }}</ref> This leads to the following SE: <math> \widehat{\operatorname{SE}(\hat{p}_i - \hat{p}_j)}_{wald+\frac{k}{2}} = \sqrt{\frac{\left(\hat{p}_i + \hat{p}_j + \frac{1}{n}\right)\frac{n}{n+\frac{k}{2}} - \left(\hat{p}_i - \hat{p}_j\right)^2 \left(\frac{n}{n+\frac{k}{2}}\right)^2 }{n+\frac{k}{2}}} </math> {{hidden begin|style=width:100%|ta1=center|border=1px #aaa solid|title=[Proof]}} <math> \begin{align} \widehat{\operatorname{SE}(\hat{p}_i - \hat{p}_j)}_{wald+\frac{k}{2}} & = \sqrt{\frac{\left(\frac{x_i+1/2}{n+\frac{k}{2}} + \frac{x_j+1/2}{n+\frac{k}{2}}\right) - \left(\frac{x_i+1/2}{n+\frac{k}{2}} - \frac{x_j+1/2}{n+\frac{k}{2}}\right)^2}{n+\frac{k}{2}}} \\ & = \sqrt{\frac{\left(\frac{x_i}{n} + \frac{x_j}{n} + \frac{1}{n}\right)\frac{n}{n+\frac{k}{2}} - \left(\frac{x_i}{n} - \frac{x_j}{n}\right)^2 \left(\frac{n}{n+\frac{k}{2}}\right)^2 }{n+\frac{k}{2}}} \\ & = \sqrt{\frac{\left(\hat{p}_i + \hat{p}_j + \frac{1}{n}\right)\frac{n}{n+\frac{k}{2}} - \left(\hat{p}_i - \hat{p}_j\right)^2 \left(\frac{n}{n+\frac{k}{2}}\right)^2 }{n+\frac{k}{2}}} \end{align} </math> {{hidden end}} Which can just be plugged into the original Wald formula as follows: <math>(p_i - p_j)\frac{n}{n+\frac{k}{2}} \pm z_{\alpha/2} \cdot \widehat{\operatorname{SE}(\hat{p}_i - \hat{p}_j)}_{wald+\frac{k}{2}}</math>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)