Editing Multinomial distribution (section)

==== Concentration at large ''n'' ====

Due to the exponential decay, at large <math>n</math>, almost all the probability mass is concentrated in a small neighborhood of <math>p</math>. In this small neighborhood, we can take the first nonzero term in the Taylor expansion of <math>D_{KL}</math>, to obtain<math display="block">\ln \binom{n}{x_1, \cdots, x_k} p_1^{x_1} \cdots p_k^{x_k} \approx -\frac n2 \sum_{i=1}^k \frac{(\hat p_i - p_i)^2}{p_i} = -\frac 12 \sum_{i=1}^k \frac{(x_i - n p_i)^2}{n p_i}</math>This resembles the gaussian distribution, which suggests the following theorem:

'''Theorem.''' At the <math>n \to \infty</math> limit, <math>n \sum_{i=1}^k \frac{(\hat p_i - p_i)^2}{p_i} = \sum_{i=1}^k \frac{(x_i - n p_i)^2}{n p_i}</math> [[converges in distribution]] to the [[chi-squared distribution]] <math>\chi^2(k-1)</math>.
[[File:Convergence of multinomial distribution to the gaussian distribution.webm|thumb|339x339px|If we sample from the multinomial distribution <math>\mathrm{Multinomial}(n; 0.2, 0.3, 0.5)</math>, and plot the heatmap of the samples within the 2-dimensional simplex (here shown as a black triangle), we notice that as <math>n \to \infty</math>, the distribution converges to a gaussian around the point <math>(0.2, 0.3, 0.5)</math>, with the contours converging in shape to ellipses, with radii converging as <math>1/\sqrt n</math>. Meanwhile, the separation between the discrete points converge as <math>1/n</math>, and so the discrete multinomial distribution converges to a continuous gaussian distribution.]]

{{hidden begin|style=width:100%|ta1=center|border=1px #aaa solid|title=[Proof]}}

The space of all distributions over categories <math>\{1, 2, \ldots, k\}</math> is a [[simplex]]: <math>\Delta_{k} = \left\{(y_1, \ldots, y_k)\colon y_1, \ldots, y_k \geq 0, \sum_i y_i = 1\right\}</math>, and the set of all possible empirical distributions after <math>n</math> experiments is a subset of the simplex: <math>\Delta_{k, n} = \left\{(x_1/n, \ldots, x_k/n)\colon x_1, \ldots, x_k \in \N, \sum_i x_i = n\right\}</math>. That is, it is the intersection between <math>\Delta_k</math> and the lattice <math>(\Z^k)/n</math>.

As <math>n</math> increases, most of the probability mass is concentrated in a subset of <math>\Delta_{k, n}</math> near <math>p</math>, and the probability distribution near <math>p</math> becomes well-approximated by <math display="block">\binom{n}{x_1, \cdots, x_k} p_1^{x_1} \cdots p_k^{x_k} \approx e^{-\frac n2 \sum_i \frac{(\hat p_i - p_i)^2}{p_i}}</math>From this, we see that the subset upon which the mass is concentrated has radius on the order of <math>1/\sqrt n</math>, but the points in the subset are separated by distance on the order of <math>1/n</math>, so at large <math>n</math>, the points merge into a continuum.
To convert this from a discrete probability distribution to a continuous probability density, we need to multiply by the volume occupied by each point of <math>\Delta_{k, n}</math> in <math>\Delta_k</math>. However, by symmetry, every point occupies exactly the same volume (except a negligible set on the boundary), so we obtain a probability density <math>\rho(\hat p) = C e^{-\frac n2 \sum_i \frac{(\hat p_i - p_i)^2}{p_i}}</math>, where <math>C</math> is a constant.

Finally, since the simplex <math>\Delta_k</math> is not all of <math>\R^k</math>, but only within a <math>(k-1)</math>-dimensional plane, we obtain the desired result.

{{hidden end}}