Editing Multinomial distribution (section)

==== Asymptotics ====

By [[Stirling's formula]], at the limit of <math>n, x_1, ..., x_k \to \infty</math>, we have<math display="block">\ln \binom{n}{x_1, \cdots, x_k} + \sum_{i=1}^k x_i\ln p_i = -n D_{KL}(\hat p \| p) - \frac{k-1}{2} \ln(2\pi n) - \frac 12 \sum_{i=1}^k \ln(\hat p_i)  + o(1)</math>where relative frequencies <math>\hat p_i = x_i/n</math> in the data can be interpreted as probabilities from the empirical distribution <math>\hat p</math>, and <math>D_{KL}</math> is the [[Kullback–Leibler divergence]].

This formula can be interpreted as follows.

Consider <math>\Delta_k</math>, the space of all possible distributions over the categories <math>\{1, 2, ..., k\}</math>. It is a [[simplex]]. After <math>n</math> independent samples from the categorical distribution <math>p</math> (which is how we construct the multinomial distribution), we obtain an empirical distribution <math>\hat p</math>.

By the asymptotic formula, the probability that empirical distribution <math>\hat p</math> deviates from the actual distribution <math>p</math> decays exponentially, at a rate <math> n D_{KL}(\hat p \| p)</math>. The more experiments and the more different <math>\hat p</math> is from <math>p</math>, the less likely it is to see such an empirical distribution.

If <math>A</math> is a closed subset of <math>\Delta_k</math>, then by dividing up <math>A</math> into pieces, and reasoning about the growth rate of <math>Pr(\hat p \in A_\epsilon)</math> on each piece <math>A_\epsilon</math>, we obtain [[Sanov's theorem]], which states that<math display="block">\lim_{n \to \infty} \frac 1n \ln Pr(\hat p \in A)  = - \inf_{\hat p \in A} D_{KL}(\hat p \| p)</math>