Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Multinomial distribution
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== Large deviation theory === {{See also|Sanov's theorem}} ==== Asymptotics ==== By [[Stirling's formula]], at the limit of <math>n, x_1, ..., x_k \to \infty</math>, we have<math display="block">\ln \binom{n}{x_1, \cdots, x_k} + \sum_{i=1}^k x_i\ln p_i = -n D_{KL}(\hat p \| p) - \frac{k-1}{2} \ln(2\pi n) - \frac 12 \sum_{i=1}^k \ln(\hat p_i) + o(1)</math>where relative frequencies <math>\hat p_i = x_i/n</math> in the data can be interpreted as probabilities from the empirical distribution <math>\hat p</math>, and <math>D_{KL}</math> is the [[Kullback–Leibler divergence]]. This formula can be interpreted as follows. Consider <math>\Delta_k</math>, the space of all possible distributions over the categories <math>\{1, 2, ..., k\}</math>. It is a [[simplex]]. After <math>n</math> independent samples from the categorical distribution <math>p</math> (which is how we construct the multinomial distribution), we obtain an empirical distribution <math>\hat p</math>. By the asymptotic formula, the probability that empirical distribution <math>\hat p</math> deviates from the actual distribution <math>p</math> decays exponentially, at a rate <math> n D_{KL}(\hat p \| p)</math>. The more experiments and the more different <math>\hat p</math> is from <math>p</math>, the less likely it is to see such an empirical distribution. If <math>A</math> is a closed subset of <math>\Delta_k</math>, then by dividing up <math>A</math> into pieces, and reasoning about the growth rate of <math>Pr(\hat p \in A_\epsilon)</math> on each piece <math>A_\epsilon</math>, we obtain [[Sanov's theorem]], which states that<math display="block">\lim_{n \to \infty} \frac 1n \ln Pr(\hat p \in A) = - \inf_{\hat p \in A} D_{KL}(\hat p \| p)</math> ==== Concentration at large ''n'' ==== Due to the exponential decay, at large <math>n</math>, almost all the probability mass is concentrated in a small neighborhood of <math>p</math>. In this small neighborhood, we can take the first nonzero term in the Taylor expansion of <math>D_{KL}</math>, to obtain<math display="block">\ln \binom{n}{x_1, \cdots, x_k} p_1^{x_1} \cdots p_k^{x_k} \approx -\frac n2 \sum_{i=1}^k \frac{(\hat p_i - p_i)^2}{p_i} = -\frac 12 \sum_{i=1}^k \frac{(x_i - n p_i)^2}{n p_i}</math>This resembles the gaussian distribution, which suggests the following theorem: '''Theorem.''' At the <math>n \to \infty</math> limit, <math>n \sum_{i=1}^k \frac{(\hat p_i - p_i)^2}{p_i} = \sum_{i=1}^k \frac{(x_i - n p_i)^2}{n p_i}</math> [[converges in distribution]] to the [[chi-squared distribution]] <math>\chi^2(k-1)</math>. [[File:Convergence of multinomial distribution to the gaussian distribution.webm|thumb|339x339px|If we sample from the multinomial distribution <math>\mathrm{Multinomial}(n; 0.2, 0.3, 0.5)</math>, and plot the heatmap of the samples within the 2-dimensional simplex (here shown as a black triangle), we notice that as <math>n \to \infty</math>, the distribution converges to a gaussian around the point <math>(0.2, 0.3, 0.5)</math>, with the contours converging in shape to ellipses, with radii converging as <math>1/\sqrt n</math>. Meanwhile, the separation between the discrete points converge as <math>1/n</math>, and so the discrete multinomial distribution converges to a continuous gaussian distribution.]] {{hidden begin|style=width:100%|ta1=center|border=1px #aaa solid|title=[Proof]}} The space of all distributions over categories <math>\{1, 2, \ldots, k\}</math> is a [[simplex]]: <math>\Delta_{k} = \left\{(y_1, \ldots, y_k)\colon y_1, \ldots, y_k \geq 0, \sum_i y_i = 1\right\}</math>, and the set of all possible empirical distributions after <math>n</math> experiments is a subset of the simplex: <math>\Delta_{k, n} = \left\{(x_1/n, \ldots, x_k/n)\colon x_1, \ldots, x_k \in \N, \sum_i x_i = n\right\}</math>. That is, it is the intersection between <math>\Delta_k</math> and the lattice <math>(\Z^k)/n</math>. As <math>n</math> increases, most of the probability mass is concentrated in a subset of <math>\Delta_{k, n}</math> near <math>p</math>, and the probability distribution near <math>p</math> becomes well-approximated by <math display="block">\binom{n}{x_1, \cdots, x_k} p_1^{x_1} \cdots p_k^{x_k} \approx e^{-\frac n2 \sum_i \frac{(\hat p_i - p_i)^2}{p_i}}</math>From this, we see that the subset upon which the mass is concentrated has radius on the order of <math>1/\sqrt n</math>, but the points in the subset are separated by distance on the order of <math>1/n</math>, so at large <math>n</math>, the points merge into a continuum. To convert this from a discrete probability distribution to a continuous probability density, we need to multiply by the volume occupied by each point of <math>\Delta_{k, n}</math> in <math>\Delta_k</math>. However, by symmetry, every point occupies exactly the same volume (except a negligible set on the boundary), so we obtain a probability density <math>\rho(\hat p) = C e^{-\frac n2 \sum_i \frac{(\hat p_i - p_i)^2}{p_i}}</math>, where <math>C</math> is a constant. Finally, since the simplex <math>\Delta_k</math> is not all of <math>\R^k</math>, but only within a <math>(k-1)</math>-dimensional plane, we obtain the desired result. {{hidden end}} ==== Conditional concentration at large ''n'' ==== The above concentration phenomenon can be easily generalized to the case where we condition upon linear constraints. This is the theoretical justification for [[Pearson's chi-squared test]]. '''Theorem.''' Given frequencies <math>x_i\in\mathbb N</math> observed in a dataset with <math>n</math> points, we impose <math>\ell + 1</math> [[Linear independence|independent linear]] constraints <math display="block">\begin{cases} \sum_i \hat p_i = 1, \\ \sum_i a_{1i} \hat p_i = b_1, \\ \sum_i a_{2i} \hat p_i = b_2, \\ \cdots, \\ \sum_i a_{\ell i} \hat p_i = b_{\ell} \end{cases} </math>(notice that the first constraint is simply the requirement that the empirical distributions sum to one), such that empirical <math>\hat p_i=x_i/n</math> satisfy all these constraints simultaneously. Let <math>q</math> denote the <math>I</math>-projection of prior distribution <math>p</math> on the sub-region of the simplex allowed by the linear constraints. At the <math>n \to \infty</math> limit, sampled counts <math>n \hat p_i</math> from the multinomial distribution '''conditional on''' the linear constraints are governed by <math>2n D_{KL}(\hat p \vert\vert q) \approx n \sum_i \frac{(\hat p_i - q_i)^2}{q_i}</math> which [[converges in distribution]] to the [[chi-squared distribution]] <math>\chi^2(k-1-\ell)</math>. {{hidden begin|style=width:100%|ta1=center|border=1px #aaa solid|title=[Proof]}} An analogous proof applies in this Diophantine problem of coupled linear equations in count variables <math>n \hat p_i</math>,<ref>{{cite arXiv|last1=Loukas|first1=Orestis|last2=Chung|first2=Ho Ryun|date=2023|title=Total Empiricism: Learning from Data|eprint=2311.08315|class=math.ST}}</ref> but this time <math>\Delta_{k, n}</math> is the intersection of <math>(\Z^k)/n</math> with <math>\Delta_k </math> and <math>\ell </math> hyperplanes, all linearly independent, so the probability density <math>\rho(\hat p)</math> is restricted to a <math>(k-\ell-1)</math>-dimensional plane. In particular, expanding the KL divergence <math>D_{KL}(\hat p\vert\vert p)</math> around its minimum <math>q</math> (the <math>I</math>-projection of <math>p</math> on <math>\Delta_{k, n}</math>) in the constrained problem ensures by the Pythagorean theorem for <math>I</math>-divergence that any constant and linear term in the counts <math>n \hat p_i</math> vanishes from the conditional probability to multinationally sample those counts. Notice that by definition, every one of <math>\hat p_1, \hat p_2, ..., \hat p_k</math> must be a rational number, whereas <math>p_1, p_2, ..., p_k</math> may be chosen from any real number in <math>[0, 1]</math> and need not satisfy the Diophantine system of equations. Only asymptotically as <math>n\rightarrow\infty</math>, the <math>\hat p_i</math>'s can be regarded as probabilities over <math>[0, 1]</math>. {{hidden end}} Away from empirically observed constraints <math>b_1,\ldots,b_\ell</math> (such as moments or prevalences) the theorem can be generalized: '''Theorem.''' * Given functions <math>f_1, ..., f_\ell</math>, such that they are continuously differentiable in a neighborhood of <math>p</math>, and the vectors <math>(1, 1, ..., 1), \nabla f_1(p), ..., \nabla f_\ell(p)</math> are linearly independent; * given sequences <math>\epsilon_1(n), ..., \epsilon_\ell(n)</math>, such that asymptotically <math>\frac 1n \ll \epsilon_i(n) \ll \frac{1}{\sqrt n}</math> for each <math>i \in \{1, ..., \ell\}</math>; * then for the multinomial distribution '''conditional on''' constraints <math>f_1(\hat p) \in [f_1(p)- \epsilon_1(n), f_1(p) + \epsilon_1(n)], ..., f_\ell(\hat p) \in [f_\ell(p)- \epsilon_\ell(n), f_\ell(p) + \epsilon_\ell(n)]</math>, we have the quantity <math>n \sum_i \frac{(\hat p_i - p_i)^2}{p_i} = \sum_i \frac{(x_i - n p_i)^2}{n p_i}</math> converging in distribution to <math>\chi^2(k-1-\ell)</math> at the <math>n \to \infty</math> limit. In the case that all <math>\hat p_i</math> are equal, the Theorem reduces to the concentration of entropies around the Maximum Entropy.<ref>{{cite arXiv|last1=Loukas|first1=Orestis|last2=Chung|first2=Ho Ryun|date=April 2022|title=Categorical Distributions of Maximum Entropy under Marginal Constraints|class=hep-th |eprint=2204.03406}}</ref><ref>{{cite arXiv|last1=Loukas|first1=Orestis|last2=Chung|first2=Ho Ryun|date=June 2022|title=Entropy-based Characterization of Modeling Constraints|class=stat.ME |eprint=2206.14105}}</ref>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)