Editing Maximum likelihood estimation (section)

== Principles ==

We model a set of observations as a random [[Sample (statistics)|sample]] from an unknown [[joint probability distribution]] which is expressed in terms of a set of [[statistical parameters|parameters]]. The goal of maximum likelihood estimation is to determine the parameters for which the observed data have the highest joint probability. We write the parameters governing the joint distribution as a vector <math>\; \theta = \left[ \theta_{1},\, \theta_2,\, \ldots,\, \theta_k \right]^{\mathsf{T}} \;</math> so that this distribution falls within a [[parametric family]] <math>\; \{ f(\cdot\,;\theta) \mid \theta \in \Theta \} \;,</math> where <math>\, \Theta \,</math> is called the ''[[parameter space]]'', a finite-dimensional subset of [[Euclidean space]]. Evaluating the joint density at the observed data sample <math>\; \mathbf{y} = (y_1, y_2, \ldots, y_n) \;</math> gives a real-valued function,
<math display="block">\mathcal{L}_{n}(\theta) = \mathcal{L}_{n}(\theta; \mathbf{y}) = f_{n}(\mathbf{y}; \theta) \;,</math>
which is called the [[likelihood function]]. For [[Independence (probability theory)|independent random variables]], <math>f_{n}(\mathbf{y}; \theta)</math> will be the product of univariate [[Probability density function|density functions]]:
<math display="block">f_{n}(\mathbf{y}; \theta) =  \prod_{k=1}^n \, f_k^\mathsf{univar}(y_k; \theta) ~.</math>

The goal of maximum likelihood estimation is to find the values of the model parameters that maximize the likelihood function over the parameter space,<ref name=":0">{{cite journal |last = Myung |first = I.J. |year = 2003 |title = Tutorial on maximum likelihood Estimation |journal = [[Journal of Mathematical Psychology]] |volume=47 |issue=1 |pages=90–100 |doi = 10.1016/S0022-2496(02)00028-7 }}</ref> that is:
<math display="block">
    \hat{\theta} = \underset{\theta\in\Theta}{\operatorname{arg\;max}}\,\mathcal{L}_{n}(\theta\,;\mathbf{y}) ~.
</math>

Intuitively, this selects the parameter values that make the observed data most probable. The specific value <math>~ \hat{\theta} = \hat{\theta}_{n}(\mathbf{y}) \in \Theta ~</math> that maximizes the likelihood function <math>\, \mathcal{L}_{n} \,</math> is called the maximum likelihood estimate. Further, if the function <math>\; \hat{\theta}_{n} : \mathbb{R}^{n} \to \Theta \;</math> so defined is [[measurable function|measurable]], then it is called the maximum likelihood [[estimator]]. It is generally a function defined over the [[sample space]], i.e. taking a given sample as its argument. A [[Necessity and sufficiency|sufficient but not necessary]] condition for its existence is for the likelihood function to be [[Continuous function|continuous]] over a parameter space <math>\, \Theta \,</math> that is [[Compact space|compact]].<ref>{{cite book |first1=Christian |last1=Gourieroux |first2=Alain |last2=Monfort |year=1995 |title=Statistics and Econometrics Models |publisher=Cambridge University Press |isbn=0-521-40551-3 |page=[https://archive.org/details/statisticseconom00gour_434/page/n172 161] |url=https://archive.org/details/statisticseconom00gour_434 |url-access=limited}}</ref> For an [[Open set|open]] <math>\, \Theta \,</math> the likelihood function may increase without ever reaching a supremum value.

In practice, it is often convenient to work with the [[natural logarithm]] of the likelihood function, called the [[log-likelihood]]:
<math display="block">
    \ell(\theta\,;\mathbf{y}) =  \ln \mathcal{L}_{n}(\theta\,;\mathbf{y}) ~.
</math>
Since the logarithm is a [[monotonic function]], the maximum of <math>\; \ell(\theta\,;\mathbf{y}) \;</math> occurs at the same value of <math>\theta</math> as does the maximum of <math>\, \mathcal{L}_{n} ~.</math><ref>{{cite book |first=Edward J. |last=Kane |year=1968 |title=Economic Statistics and Econometrics |location=New York, NY |publisher=Harper & Row |page=[https://archive.org/details/economicstatisti00kane/page/n200 179] |url=https://archive.org/details/economicstatisti00kane |url-access=registration}}</ref> If <math>\ell(\theta\,;\mathbf{y})</math> is [[Differentiable function|differentiable]] in <math>\, \Theta \,,</math> [[Derivative test|sufficient conditions]] for the occurrence of a maximum (or a minimum) are
<math display="block">\frac{\partial \ell}{\partial \theta_{1}} = 0, \quad \frac{\partial \ell}{\partial \theta_{2}} = 0, \quad \ldots, \quad \frac{\partial \ell}{\partial \theta_{k}} = 0 ~,</math>
known as the likelihood equations. For some models, these equations can be explicitly solved for <math>\, \widehat{\theta\,} \,,</math> but in general no closed-form solution to the maximization problem is known or available, and an MLE can only be found via [[Mathematical optimization|numerical optimization]]. Another problem is that in finite samples, there may exist multiple [[Zero of a function|roots]] for the likelihood equations.<ref>{{cite book |first1=Christoper G. |last1=Small |first2=Jinfang |last2=Wang |year=2003 |chapter=Working with roots |title=Numerical Methods for Nonlinear Estimating Equations |publisher=Oxford University Press |isbn=0-19-850688-0 |pages=74–124 |chapter-url=https://books.google.com/books?id=hMrwQVllY5AC&pg=PA74 }}</ref> Whether the identified root <math>\, \widehat{\theta\,} \,</math> of the likelihood equations is indeed a (local) maximum depends on whether the matrix of second-order partial and cross-partial derivatives, the so-called [[Hessian matrix]]

<math display="block">\mathbf{H}\left(\widehat{\theta\,}\right) = \begin{bmatrix}
 \left. \frac{\partial^2 \ell}{\partial \theta_1^2} \right|_{\theta=\widehat{\theta\,}} &
 \left. \frac{\partial^2 \ell}{\partial \theta_1 \, \partial \theta_2} \right|_{\theta=\widehat{\theta\,}} &
 \dots &
 \left. \frac{\partial^2 \ell}{\partial \theta_1 \, \partial \theta_k} \right|_{\theta=\widehat{\theta\,}} \\
 \left. \frac{\partial^2 \ell}{\partial \theta_2 \, \partial \theta_1} \right|_{\theta=\widehat{\theta\,}} &
 \left. \frac{\partial^2 \ell}{\partial \theta_2^2} \right|_{\theta=\widehat{\theta\,}} &
 \dots &
 \left. \frac{\partial^2 \ell}{\partial \theta_2 \, \partial \theta_k} \right|_{\theta=\widehat{\theta\,}} \\
 \vdots & \vdots & \ddots & \vdots \\
 \left. \frac{\partial^2 \ell}{\partial \theta_k \, \partial \theta_1} \right|_{\theta=\widehat{\theta\,}} &
 \left. \frac{\partial^2 \ell}{\partial \theta_k \, \partial \theta_2} \right|_{\theta=\widehat{\theta\,}} &
 \dots &
 \left. \frac{\partial^2 \ell}{\partial \theta_k^2} \right|_{\theta=\widehat{\theta\,}}
\end{bmatrix} ~,</math>

is [[negative semi-definite]] at <math>\widehat{\theta\,}</math>, as this indicates local [[Concave function|concavity]]. Conveniently, most common [[probability distribution]]s – in particular the [[exponential family]] – are [[Logarithmically concave function|logarithmically concave]].<ref>
{{cite book
 |first1=Robert E. |last1=Kass
 |first2=Paul W. |last2=Vos
 |year=1997
 |title=Geometrical Foundations of Asymptotic Inference
 |page=14
 |location=New York, NY
 |publisher=John Wiley & Sons
 |isbn=0-471-82668-5
 |url=https://books.google.com/books?id=e43EAIfUPCwC&pg=PA14
}}
</ref><ref>
{{cite web
 |first=Alecos |last=Papadopoulos
 |date=25 September 2013
 |title=Why we always put log() before the joint pdf when we use MLE (Maximum likelihood Estimation)?
 |website=[[Stack Exchange]]
 |url=https://stats.stackexchange.com/q/70975
}}
</ref>

=== Restricted parameter space ===
{{Distinguish|restricted maximum likelihood}}
While the domain of the likelihood function—the [[parameter space]]—is generally a finite-dimensional subset of [[Euclidean space]], additional [[Restriction (mathematics)|restriction]]s sometimes need to be incorporated into the estimation process. The parameter space can be expressed as
<math display="block">\Theta = \left\{ \theta : \theta \in \mathbb{R}^{k},\; h(\theta) = 0 \right\} ~,</math>

where <math>\; h(\theta) = \left[ h_{1}(\theta), h_{2}(\theta), \ldots, h_{r}(\theta) \right] \;</math> is a [[vector-valued function]] mapping <math>\, \mathbb{R}^{k} \,</math> into <math>\; \mathbb{R}^{r} ~.</math> Estimating the true parameter <math>\theta</math> belonging to <math>\Theta</math> then, as a practical matter, means to find the maximum of the likelihood function subject to the [[Constraint (mathematics)|constraint]] <math>~h(\theta) = 0 ~.</math>

Theoretically, the most natural approach to this [[constrained optimization]] problem is the method of substitution, that is "filling out" the restrictions <math>\; h_{1}, h_{2}, \ldots, h_{r} \;</math> to a set <math>\; h_{1}, h_{2}, \ldots, h_{r}, h_{r+1}, \ldots, h_{k} \;</math> in such a way that <math>\; h^{\ast} = \left[ h_{1}, h_{2}, \ldots, h_{k} \right] \;</math> is a [[one-to-one function]] from <math>\mathbb{R}^{k}</math> to itself, and reparameterize the likelihood function by setting <math>\; \phi_{i} = h_{i}(\theta_{1}, \theta_{2}, \ldots, \theta_{k}) ~.</math><ref name="Silvey p79">{{cite book |first=S. D. |last=Silvey |year=1975 |title=Statistical Inference |location=London, UK |publisher=Chapman and Hall |isbn=0-412-13820-4 |page=79 |url=https://books.google.com/books?id=qIKLejbVMf4C&pg=PA79 }}</ref> Because of the equivariance of the maximum likelihood estimator, the properties of the MLE apply to the restricted estimates also.<ref>{{cite web |first=David |last=Olive |year=2004 |title=Does the MLE maximize the likelihood? |website=Southern Illinois University |url=http://lagrange.math.siu.edu/Olive/simle.pdf }}</ref> For instance, in a [[multivariate normal distribution]] the [[covariance matrix]] <math>\, \Sigma \,</math> must be [[Positive-definite matrix|positive-definite]]; this restriction can be imposed by replacing <math>\; \Sigma = \Gamma^{\mathsf{T}} \Gamma \;,</math> where <math>\Gamma</math> is a real [[upper triangular matrix]] and <math>\Gamma^{\mathsf{T}}</math> is its [[transpose]].<ref>{{cite journal |first=Daniel P. |last=Schwallie |year=1985 |title=Positive definite maximum likelihood covariance estimators |journal=Economics Letters |volume=17 |issue=1–2 |pages=115–117 |doi=10.1016/0165-1765(85)90139-9 }}</ref>

In practice, restrictions are usually imposed using the method of Lagrange which, given the constraints as defined above, leads to the ''restricted likelihood equations''
<math display="block">\frac{\partial \ell}{\partial \theta} - \frac{\partial h(\theta)^\mathsf{T}}{\partial \theta} \lambda = 0</math> and <math>h(\theta) = 0 \;,</math> 

where <math>~ \lambda = \left[ \lambda_{1}, \lambda_{2}, \ldots, \lambda_{r}\right]^\mathsf{T} ~</math> is a column-vector of [[Lagrange multiplier]]s and <math>\; \frac{\partial h(\theta)^\mathsf{T}}{\partial \theta} \;</math> is the {{mvar|k × r}} [[Jacobian matrix]] of partial derivatives.<ref name="Silvey p79"/> Naturally, if the constraints are not binding at the maximum, the Lagrange multipliers should be zero.<ref>{{cite book |first=Jan R. |last=Magnus |year=2017 |title=Introduction to the Theory of Econometrics |location=Amsterdam |publisher=VU University Press |pages=64–65 |isbn=978-90-8659-766-6}}</ref> This in turn allows for a statistical test of the "validity" of the constraint, known as the [[Lagrange multiplier test]].

=== Nonparametric maximum likelihood estimation ===
Nonparametric maximum likelihood estimation can be performed using the [[empirical likelihood]].