Editing Multinomial distribution (section)

==== Conditional concentration at large ''n'' ====
The above concentration phenomenon can be easily generalized to the case where we condition upon linear constraints. This is the theoretical justification for [[Pearson's chi-squared test]].

'''Theorem.''' Given frequencies <math>x_i\in\mathbb N</math> observed in a dataset with <math>n</math> points, we impose <math>\ell + 1</math> [[Linear independence|independent linear]] constraints <math display="block">\begin{cases}
\sum_i \hat p_i = 1, \\
\sum_i a_{1i} \hat p_i = b_1, \\
\sum_i a_{2i} \hat p_i = b_2, \\
\cdots, \\
\sum_i a_{\ell i} \hat p_i = b_{\ell}
\end{cases} </math>(notice that the first constraint is simply the requirement that the empirical distributions sum to one), such that empirical <math>\hat p_i=x_i/n</math> satisfy all these constraints simultaneously. Let <math>q</math> denote the <math>I</math>-projection of prior distribution <math>p</math> on the sub-region of the simplex allowed by the linear constraints. At the <math>n \to \infty</math> limit, sampled counts <math>n \hat p_i</math> from the multinomial distribution '''conditional on''' the linear constraints  are governed by <math>2n D_{KL}(\hat p \vert\vert q) \approx n \sum_i \frac{(\hat p_i - q_i)^2}{q_i}</math> which [[converges in distribution]] to the [[chi-squared distribution]] <math>\chi^2(k-1-\ell)</math>.

{{hidden begin|style=width:100%|ta1=center|border=1px #aaa solid|title=[Proof]}}

An analogous proof applies in this Diophantine problem of coupled linear equations in count variables <math>n \hat p_i</math>,<ref>{{cite arXiv|last1=Loukas|first1=Orestis|last2=Chung|first2=Ho Ryun|date=2023|title=Total Empiricism: Learning from Data|eprint=2311.08315|class=math.ST}}</ref> but this time <math>\Delta_{k, n}</math> is the intersection of <math>(\Z^k)/n</math> with <math>\Delta_k </math> and <math>\ell </math> hyperplanes, all linearly independent, so the probability density <math>\rho(\hat p)</math> is restricted to a <math>(k-\ell-1)</math>-dimensional plane. In particular, expanding the KL divergence <math>D_{KL}(\hat p\vert\vert p)</math> around its minimum <math>q</math> (the <math>I</math>-projection of <math>p</math> on <math>\Delta_{k, n}</math>) in the constrained problem ensures by the Pythagorean theorem for <math>I</math>-divergence that any constant and linear term in the counts <math>n \hat p_i</math> vanishes from the conditional probability to multinationally sample those counts.

Notice that   
by definition, every one of <math>\hat p_1, \hat p_2, ..., \hat p_k</math> must be a rational number, 
whereas <math>p_1, p_2, ..., p_k</math> may be chosen from any real number in <math>[0, 1]</math> and need not satisfy the Diophantine system of equations. 
Only asymptotically as <math>n\rightarrow\infty</math>, the <math>\hat p_i</math>'s can be regarded as probabilities over <math>[0, 1]</math>.

{{hidden end}}

Away from empirically observed constraints <math>b_1,\ldots,b_\ell</math> (such as moments or prevalences) the theorem can be generalized:

'''Theorem.'''

* Given functions <math>f_1, ..., f_\ell</math>, such that they are continuously differentiable in a neighborhood of <math>p</math>, and the vectors <math>(1, 1, ..., 1), \nabla f_1(p), ..., \nabla f_\ell(p)</math> are linearly independent;
* given sequences <math>\epsilon_1(n), ..., \epsilon_\ell(n)</math>, such that asymptotically <math>\frac 1n \ll \epsilon_i(n) \ll \frac{1}{\sqrt n}</math> for each <math>i \in \{1, ..., \ell\}</math>; 
* then for the multinomial distribution '''conditional on''' constraints <math>f_1(\hat p) \in [f_1(p)- \epsilon_1(n), f_1(p) + \epsilon_1(n)], ..., f_\ell(\hat p) \in [f_\ell(p)- \epsilon_\ell(n), f_\ell(p) + \epsilon_\ell(n)]</math>, we have the quantity <math>n \sum_i \frac{(\hat p_i - p_i)^2}{p_i} = \sum_i \frac{(x_i - n p_i)^2}{n p_i}</math> converging in distribution to <math>\chi^2(k-1-\ell)</math> at the <math>n \to \infty</math> limit.

In the case that all <math>\hat p_i</math> are equal, the Theorem reduces to the concentration of entropies around the Maximum Entropy.<ref>{{cite arXiv|last1=Loukas|first1=Orestis|last2=Chung|first2=Ho Ryun|date=April 2022|title=Categorical Distributions of Maximum Entropy under Marginal Constraints|class=hep-th |eprint=2204.03406}}</ref><ref>{{cite arXiv|last1=Loukas|first1=Orestis|last2=Chung|first2=Ho Ryun|date=June 2022|title=Entropy-based Characterization of Modeling Constraints|class=stat.ME |eprint=2206.14105}}</ref>