Editing Maximum likelihood estimation (section)

=== Consistency ===
Under the conditions outlined below, the maximum likelihood estimator is [[consistent estimator|consistent]]. The consistency means that if the data were generated by <math>f(\cdot\,;\theta_0)</math> and we have a sufficiently large number of observations ''n'', then it is possible to find the value of ''θ''<sub>0</sub> with arbitrary precision. In mathematical terms this means that as ''n'' goes to infinity the estimator <math>\widehat{\theta\,}</math> [[convergence in probability|converges in probability]] to its true value:
<math display="block">
    \widehat{\theta\,}_\mathrm{mle}\ \xrightarrow{\text{p}}\ \theta_0.
</math>

Under slightly stronger conditions, the estimator converges [[almost sure convergence|almost surely]] (or ''strongly''):
<math display="block">
    \widehat{\theta\,}_\mathrm{mle}\ \xrightarrow{\text{a.s.}}\ \theta_0.
</math>

In practical applications, data is never generated by <math>f(\cdot\,;\theta_0)</math>. Rather, <math>f(\cdot\,;\theta_0)</math> is a model, often in idealized form, of the process generated by the data. It is a common aphorism in statistics that ''[[all models are wrong]]''. Thus, true consistency does not occur in practical applications. Nevertheless, consistency is often considered to be a desirable property for an estimator to have.

To establish consistency, the following conditions are sufficient.<ref>By Theorem 2.5 in {{cite book
  | last1 = Newey | first1 = Whitney K.
  | last2 = McFadden | first2 = Daniel | author-link2 = Daniel McFadden
  | chapter = Chapter 36: Large sample estimation and hypothesis testing
  | editor1-first= Robert | editor1-last=Engle | editor2-first=Dan | editor2-last=McFadden
  | title = Handbook of Econometrics, Vol.4
  | year = 1994
  | publisher = Elsevier Science
  | pages = 2111–2245
  | isbn=978-0-444-88766-5
  }}</ref>
{{ordered list
|1= [[Identifiability|Identification]] of the model:
<math display="block">
    \theta \neq \theta_0 \quad \Leftrightarrow \quad f(\cdot\mid\theta)\neq f(\cdot\mid\theta_0).
  </math>
In other words, different parameter values ''θ'' correspond to different distributions within the model. If this condition did not hold, there would be some value ''θ''<sub>1</sub> such that ''θ''<sub>0</sub> and ''θ''<sub>1</sub> generate an identical distribution of the observable data. Then we would not be able to distinguish between these two parameters even with an infinite amount of data—these parameters would have been [[observational equivalence|observationally equivalent]]. <br />
The identification condition is absolutely necessary for the ML estimator to be consistent. When this condition holds, the limiting likelihood function ''ℓ''(''θ''{{!}}·) has unique global maximum at ''θ''<sub>0</sub>.

|2= Compactness: the parameter space Θ of the model is [[compact set|compact]].
[[File:Ee noncompactness.svg|240px|right]]
The identification condition establishes that the log-likelihood has a unique global maximum. Compactness implies that the likelihood cannot approach the maximum value arbitrarily close at some other point (as demonstrated for example in the picture on the right).

Compactness is only a sufficient condition and not a necessary condition. Compactness can be replaced by some other conditions, such as:
{{unordered list
| both [[Concave function|concavity]] of the log-likelihood function and compactness of some (nonempty) upper [[level set]]s of the log-likelihood function, or
| existence of a compact [[Neighbourhood (mathematics)|neighborhood]] {{mvar|N}} of {{mvar|θ}}<sub>0</sub> such that outside of {{mvar|N}} the log-likelihood function is less than the maximum by at least some {{nowrap|{{mvar|ε}} > 0}}.
}}
|3= Continuity: the function {{math|ln ''f''(''x'' {{!}} ''θ'')}} is continuous in {{mvar|θ}} for almost all values of {{mvar|x}}:
<math display="block"> \operatorname{\mathbb P} \Bigl[\; \ln f(x\mid\theta) \;\in\; C^0(\Theta) \;\Bigr] = 1. </math>
The continuity here can be replaced with a slightly weaker condition of [[upper semi-continuous|upper semi-continuity]].

|4= Dominance: there exists {{math|''D''(''x'')}} integrable with respect to the distribution {{math|''f''(''x''&nbsp;{{!}}&nbsp;''θ''<sub>0</sub>)}} such that
<math display="block"> \Bigl|\ln f(x\mid\theta)\Bigr| < D(x) \quad \text{ for all } \theta\in\Theta. </math>

By the [[uniform law of large numbers]], the dominance condition together with continuity establish the uniform convergence in probability of the log-likelihood:
<math display="block"> \sup_{\theta\in\Theta} \left|\widehat{\ell\,}(\theta\mid x) - \ell(\theta)\,\right|\ \xrightarrow{\text{p}}\ 0. </math>
}}

The dominance condition can be employed in the case of [[i.i.d.]] observations. In the non-i.i.d. case, the uniform convergence in probability can be checked by showing that the sequence <math>\widehat{\ell\,}(\theta\mid x)</math> is [[stochastic equicontinuity|stochastically equicontinuous]].

If one wants to demonstrate that the ML estimator <math>\widehat{\theta\,}</math> converges to ''θ''<sub>0</sub> [[almost sure convergence|almost surely]], then a stronger condition of uniform convergence almost surely has to be imposed:
<math display="block">
    \sup_{\theta\in\Theta} \left\|\;\widehat{\ell\,}(\theta\mid x) - \ell(\theta)\;\right\| \ \xrightarrow{\text{a.s.}}\ 0.
  </math>

Additionally, if (as assumed above) the data were generated by <math>f(\cdot\,;\theta_0)</math>, then under certain conditions, it can also be shown that the maximum likelihood estimator [[Convergence in distribution|converges in distribution]] to a normal distribution. Specifically,<ref name=":1">By Theorem 3.3 in {{cite book
  | last1 = Newey | first1 = Whitney K.
  | last2 = McFadden | first2 = Daniel | author-link2 = Daniel McFadden
  | chapter = Chapter 36: Large sample estimation and hypothesis testing
  | editor1-first= Robert | editor1-last=Engle | editor2-first=Dan | editor2-last=McFadden
  | title = Handbook of Econometrics, Vol.4
  | year = 1994
  | publisher = Elsevier Science
  | pages = 2111–2245
  | isbn=978-0-444-88766-5
  }}</ref>
<math display="block">
    \sqrt{n} \left(\widehat{\theta\,}_\mathrm{mle} - \theta_0\right)\ \xrightarrow{d}\ \mathcal{N}\left(0,\, I^{-1}\right)
  </math>
where {{math|''I''}} is the [[Fisher information|Fisher information matrix]].