Editing Multinomial distribution (section)

=== Large deviation theory ===
{{See also|Sanov's theorem}}

==== Asymptotics ====

By [[Stirling's formula]], at the limit of <math>n, x_1, ..., x_k \to \infty</math>, we have<math display="block">\ln \binom{n}{x_1, \cdots, x_k} + \sum_{i=1}^k x_i\ln p_i = -n D_{KL}(\hat p \| p) - \frac{k-1}{2} \ln(2\pi n) - \frac 12 \sum_{i=1}^k \ln(\hat p_i)  + o(1)</math>where relative frequencies <math>\hat p_i = x_i/n</math> in the data can be interpreted as probabilities from the empirical distribution <math>\hat p</math>, and <math>D_{KL}</math> is the [[Kullback–Leibler divergence]].

This formula can be interpreted as follows.

Consider <math>\Delta_k</math>, the space of all possible distributions over the categories <math>\{1, 2, ..., k\}</math>. It is a [[simplex]]. After <math>n</math> independent samples from the categorical distribution <math>p</math> (which is how we construct the multinomial distribution), we obtain an empirical distribution <math>\hat p</math>.

By the asymptotic formula, the probability that empirical distribution <math>\hat p</math> deviates from the actual distribution <math>p</math> decays exponentially, at a rate <math> n D_{KL}(\hat p \| p)</math>. The more experiments and the more different <math>\hat p</math> is from <math>p</math>, the less likely it is to see such an empirical distribution.

If <math>A</math> is a closed subset of <math>\Delta_k</math>, then by dividing up <math>A</math> into pieces, and reasoning about the growth rate of <math>Pr(\hat p \in A_\epsilon)</math> on each piece <math>A_\epsilon</math>, we obtain [[Sanov's theorem]], which states that<math display="block">\lim_{n \to \infty} \frac 1n \ln Pr(\hat p \in A)  = - \inf_{\hat p \in A} D_{KL}(\hat p \| p)</math>

==== Concentration at large ''n'' ====

Due to the exponential decay, at large <math>n</math>, almost all the probability mass is concentrated in a small neighborhood of <math>p</math>. In this small neighborhood, we can take the first nonzero term in the Taylor expansion of <math>D_{KL}</math>, to obtain<math display="block">\ln \binom{n}{x_1, \cdots, x_k} p_1^{x_1} \cdots p_k^{x_k} \approx -\frac n2 \sum_{i=1}^k \frac{(\hat p_i - p_i)^2}{p_i} = -\frac 12 \sum_{i=1}^k \frac{(x_i - n p_i)^2}{n p_i}</math>This resembles the gaussian distribution, which suggests the following theorem:

'''Theorem.''' At the <math>n \to \infty</math> limit, <math>n \sum_{i=1}^k \frac{(\hat p_i - p_i)^2}{p_i} = \sum_{i=1}^k \frac{(x_i - n p_i)^2}{n p_i}</math> [[converges in distribution]] to the [[chi-squared distribution]] <math>\chi^2(k-1)</math>.
[[File:Convergence of multinomial distribution to the gaussian distribution.webm|thumb|339x339px|If we sample from the multinomial distribution <math>\mathrm{Multinomial}(n; 0.2, 0.3, 0.5)</math>, and plot the heatmap of the samples within the 2-dimensional simplex (here shown as a black triangle), we notice that as <math>n \to \infty</math>, the distribution converges to a gaussian around the point <math>(0.2, 0.3, 0.5)</math>, with the contours converging in shape to ellipses, with radii converging as <math>1/\sqrt n</math>. Meanwhile, the separation between the discrete points converge as <math>1/n</math>, and so the discrete multinomial distribution converges to a continuous gaussian distribution.]]

{{hidden begin|style=width:100%|ta1=center|border=1px #aaa solid|title=[Proof]}}

The space of all distributions over categories <math>\{1, 2, \ldots, k\}</math> is a [[simplex]]: <math>\Delta_{k} = \left\{(y_1, \ldots, y_k)\colon y_1, \ldots, y_k \geq 0, \sum_i y_i = 1\right\}</math>, and the set of all possible empirical distributions after <math>n</math> experiments is a subset of the simplex: <math>\Delta_{k, n} = \left\{(x_1/n, \ldots, x_k/n)\colon x_1, \ldots, x_k \in \N, \sum_i x_i = n\right\}</math>. That is, it is the intersection between <math>\Delta_k</math> and the lattice <math>(\Z^k)/n</math>.

As <math>n</math> increases, most of the probability mass is concentrated in a subset of <math>\Delta_{k, n}</math> near <math>p</math>, and the probability distribution near <math>p</math> becomes well-approximated by <math display="block">\binom{n}{x_1, \cdots, x_k} p_1^{x_1} \cdots p_k^{x_k} \approx e^{-\frac n2 \sum_i \frac{(\hat p_i - p_i)^2}{p_i}}</math>From this, we see that the subset upon which the mass is concentrated has radius on the order of <math>1/\sqrt n</math>, but the points in the subset are separated by distance on the order of <math>1/n</math>, so at large <math>n</math>, the points merge into a continuum.
To convert this from a discrete probability distribution to a continuous probability density, we need to multiply by the volume occupied by each point of <math>\Delta_{k, n}</math> in <math>\Delta_k</math>. However, by symmetry, every point occupies exactly the same volume (except a negligible set on the boundary), so we obtain a probability density <math>\rho(\hat p) = C e^{-\frac n2 \sum_i \frac{(\hat p_i - p_i)^2}{p_i}}</math>, where <math>C</math> is a constant.

Finally, since the simplex <math>\Delta_k</math> is not all of <math>\R^k</math>, but only within a <math>(k-1)</math>-dimensional plane, we obtain the desired result.

{{hidden end}}

==== Conditional concentration at large ''n'' ====
The above concentration phenomenon can be easily generalized to the case where we condition upon linear constraints. This is the theoretical justification for [[Pearson's chi-squared test]].

'''Theorem.''' Given frequencies <math>x_i\in\mathbb N</math> observed in a dataset with <math>n</math> points, we impose <math>\ell + 1</math> [[Linear independence|independent linear]] constraints <math display="block">\begin{cases}
\sum_i \hat p_i = 1, \\
\sum_i a_{1i} \hat p_i = b_1, \\
\sum_i a_{2i} \hat p_i = b_2, \\
\cdots, \\
\sum_i a_{\ell i} \hat p_i = b_{\ell}
\end{cases} </math>(notice that the first constraint is simply the requirement that the empirical distributions sum to one), such that empirical <math>\hat p_i=x_i/n</math> satisfy all these constraints simultaneously. Let <math>q</math> denote the <math>I</math>-projection of prior distribution <math>p</math> on the sub-region of the simplex allowed by the linear constraints. At the <math>n \to \infty</math> limit, sampled counts <math>n \hat p_i</math> from the multinomial distribution '''conditional on''' the linear constraints  are governed by <math>2n D_{KL}(\hat p \vert\vert q) \approx n \sum_i \frac{(\hat p_i - q_i)^2}{q_i}</math> which [[converges in distribution]] to the [[chi-squared distribution]] <math>\chi^2(k-1-\ell)</math>.

{{hidden begin|style=width:100%|ta1=center|border=1px #aaa solid|title=[Proof]}}

An analogous proof applies in this Diophantine problem of coupled linear equations in count variables <math>n \hat p_i</math>,<ref>{{cite arXiv|last1=Loukas|first1=Orestis|last2=Chung|first2=Ho Ryun|date=2023|title=Total Empiricism: Learning from Data|eprint=2311.08315|class=math.ST}}</ref> but this time <math>\Delta_{k, n}</math> is the intersection of <math>(\Z^k)/n</math> with <math>\Delta_k </math> and <math>\ell </math> hyperplanes, all linearly independent, so the probability density <math>\rho(\hat p)</math> is restricted to a <math>(k-\ell-1)</math>-dimensional plane. In particular, expanding the KL divergence <math>D_{KL}(\hat p\vert\vert p)</math> around its minimum <math>q</math> (the <math>I</math>-projection of <math>p</math> on <math>\Delta_{k, n}</math>) in the constrained problem ensures by the Pythagorean theorem for <math>I</math>-divergence that any constant and linear term in the counts <math>n \hat p_i</math> vanishes from the conditional probability to multinationally sample those counts.

Notice that   
by definition, every one of <math>\hat p_1, \hat p_2, ..., \hat p_k</math> must be a rational number, 
whereas <math>p_1, p_2, ..., p_k</math> may be chosen from any real number in <math>[0, 1]</math> and need not satisfy the Diophantine system of equations. 
Only asymptotically as <math>n\rightarrow\infty</math>, the <math>\hat p_i</math>'s can be regarded as probabilities over <math>[0, 1]</math>.

{{hidden end}}

Away from empirically observed constraints <math>b_1,\ldots,b_\ell</math> (such as moments or prevalences) the theorem can be generalized:

'''Theorem.'''

* Given functions <math>f_1, ..., f_\ell</math>, such that they are continuously differentiable in a neighborhood of <math>p</math>, and the vectors <math>(1, 1, ..., 1), \nabla f_1(p), ..., \nabla f_\ell(p)</math> are linearly independent;
* given sequences <math>\epsilon_1(n), ..., \epsilon_\ell(n)</math>, such that asymptotically <math>\frac 1n \ll \epsilon_i(n) \ll \frac{1}{\sqrt n}</math> for each <math>i \in \{1, ..., \ell\}</math>; 
* then for the multinomial distribution '''conditional on''' constraints <math>f_1(\hat p) \in [f_1(p)- \epsilon_1(n), f_1(p) + \epsilon_1(n)], ..., f_\ell(\hat p) \in [f_\ell(p)- \epsilon_\ell(n), f_\ell(p) + \epsilon_\ell(n)]</math>, we have the quantity <math>n \sum_i \frac{(\hat p_i - p_i)^2}{p_i} = \sum_i \frac{(x_i - n p_i)^2}{n p_i}</math> converging in distribution to <math>\chi^2(k-1-\ell)</math> at the <math>n \to \infty</math> limit.

In the case that all <math>\hat p_i</math> are equal, the Theorem reduces to the concentration of entropies around the Maximum Entropy.<ref>{{cite arXiv|last1=Loukas|first1=Orestis|last2=Chung|first2=Ho Ryun|date=April 2022|title=Categorical Distributions of Maximum Entropy under Marginal Constraints|class=hep-th |eprint=2204.03406}}</ref><ref>{{cite arXiv|last1=Loukas|first1=Orestis|last2=Chung|first2=Ho Ryun|date=June 2022|title=Entropy-based Characterization of Modeling Constraints|class=stat.ME |eprint=2206.14105}}</ref>