Editing Maximum likelihood estimation (section)

=== Relation to minimizing Kullback–Leibler divergence and cross entropy ===
Finding <math>\hat \theta</math> that maximizes the likelihood is asymptotically equivalent to finding the <math>\hat \theta</math> that defines a probability distribution (<math>Q_{\hat \theta}</math>) that has a minimal distance, in terms of [[Kullback–Leibler divergence]], to the real probability distribution from which our data were generated (i.e., generated by <math>P_{\theta_0}</math>).<ref>cmplx96 (https://stats.stackexchange.com/users/177679/cmplx96), Kullback–Leibler divergence, URL (version: 2017-11-18): https://stats.stackexchange.com/q/314472 (at the youtube video, look at minutes 13 to 25)</ref> In an ideal world, P and Q are the same (and the only thing unknown is <math>\theta</math> that defines P), but even if they are not and the model we use is misspecified, still the MLE will give us the "closest" distribution (within the restriction of a model Q that depends on <math>\hat \theta</math>) to the real distribution <math>P_{\theta_0}</math>.<ref>[https://web.stanford.edu/class/stats200/Lecture16.pdf Introduction to Statistical Inference | Stanford (Lecture 16 — MLE under model misspecification)]</ref>

{| role="presentation" class="wikitable mw-collapsible mw-collapsed"
| '''Proof.'''
|-
|
For simplicity of notation, let's assume that P=Q. Let there be ''n'' [[i.i.d]] data samples <math>\mathbf{y} = (y_1, y_2, \ldots, y_n)</math> from some probability <math>y \sim P_{\theta_0}</math>, that we try to estimate by finding <math>\hat \theta</math> that will maximize the likelihood using <math>P_{\theta}</math>, then:
<math display="block">\begin{align}
\hat \theta &= \underset{\theta}{\operatorname{arg\,max}}\, L_{P_{\theta}}(\mathbf{y}) = \underset{\theta}{\operatorname{arg\,max}}\, P_{\theta} (\mathbf{y}) = \underset{\theta}{\operatorname{arg\,max}}\, P (\mathbf{y} \mid  \theta)\\
 &= \underset{\theta}{\operatorname{arg\,max}}\, \prod_{i=1}^n P (y_i \mid \theta) = \underset{\theta}{\operatorname{arg\,max}}\, \sum_{i=1}^n \log P (y_i \mid \theta) \\
 &= \underset{\theta}{\operatorname{arg\,max}}\, \left( \sum_{i=1}^n \log P (y_i \mid \theta) - \sum_{i=1}^n \log P (y_i \mid \theta_0) \right) = \underset{\theta}{\operatorname{arg\,max}}\,  \sum_{i=1}^n \left( \log P (y_i \mid \theta) - \log P (y_i \mid \theta_0) \right) \\
 &=  \underset{\theta}{\operatorname{arg\,max}}\,  \sum_{i=1}^n \log\frac{P (y_i \mid  \theta)}{P (y_i \mid  \theta_0)} = \underset{\theta}{\operatorname{arg\,min}}\,  \sum_{i=1}^n \log \frac{P (y_i \mid  \theta_0)}{P (y_i \mid  \theta)}
  = \underset{\theta}{\operatorname{arg\,min}}\,  \frac{1}{n} \sum_{i=1}^n  \log \frac{P (y_i \mid  \theta_0)}{P (y_i \mid  \theta)} \\
 &= \underset{\theta}{\operatorname{arg\,min}}\,  \frac{1}{n} \sum_{i=1}^n  h_{\theta}(y_i)  \quad \underset{n\to\infty}{\longrightarrow} \quad \underset{\theta}{\operatorname{arg\,min}}\,  E [ h_{\theta}(y) ]  \\
 &=\underset{\theta}{\operatorname{arg\,min}}\,  \int  P_{\theta_0}(y) h_\theta(y) dy  =  \underset{\theta}{\operatorname{arg\,min}}\,  \int  P_{\theta_0}(y) \log \frac{P (y \mid  \theta_0)}{P (y \mid  \theta)}  dy\\
 &= \underset{\theta}{\operatorname{arg\,min}}\,  D_\text{KL}(P_{\theta_0} \parallel P_{\theta}) 
\end{align}</math>

Where <math>h_{\theta}(x) = \log \frac{P (x \mid \theta_0)}{P (x \mid \theta)}</math>. Using ''h'' helps see how we are using the [[law of large numbers]] to move from the average of ''h''(''x'') to the [[Expected value|expectancy]] of it using the [[law of the unconscious statistician]]. The first several transitions have to do with laws of [[logarithm]] and that finding <math>\hat \theta</math> that maximizes some function will also be the one that maximizes some monotonic transformation of that function (i.e.: adding/multiplying by a constant).

Since [[Kullback–Leibler divergence#Cross entropy|cross entropy]] is just [[Entropy (information theory)|Shannon's entropy]] plus KL divergence, and since the entropy of <math>P_{\theta_0}</math> is constant, then the MLE is also asymptotically minimizing cross entropy.<ref>Sycorax says Reinstate Monica (https://stats.stackexchange.com/users/22311/sycorax-says-reinstate-monica), the relationship between maximizing the likelihood and minimizing the cross-entropy, URL (version: 2019-11-06): https://stats.stackexchange.com/q/364237</ref>
|}