Editing Maximum likelihood estimation (section)

== Iterative procedures ==
Except for special cases, the likelihood equations
<math display="block">\frac{\partial \ell(\theta;\mathbf{y})}{\partial \theta} = 0</math>

cannot be solved explicitly for an estimator <math>\widehat{\theta} = \widehat{\theta}(\mathbf{y})</math>. Instead, they need to be solved [[Iterative method|iteratively]]: starting from an initial guess of <math>\theta</math> (say <math>\widehat{\theta}_{1}</math>), one seeks to obtain a convergent sequence <math>\left\{ \widehat{\theta}_{r} \right\}</math>. Many methods for this kind of [[optimization problem]] are available,<ref>
{{cite book
 |first=R. |last=Fletcher
 |year=1987
 |title=Practical Methods of Optimization
 |location=New York, NY
 |publisher=John Wiley & Sons
 |edition=Second
 |isbn=0-471-91547-5
 |url=https://archive.org/details/practicalmethods0000flet
 |url-access=registration
}}
</ref><ref>
{{cite book
 |first1=Jorge |last1=Nocedal |author-link=Jorge Nocedal
 |first2=Stephen J. |last2=Wright
 |year=2006
 |title=Numerical Optimization
 |location=New York, NY
 |publisher=Springer
 |edition=Second
 |isbn=0-387-30303-0
}}
</ref> but the most commonly used ones are algorithms based on an updating formula of the form
<math display="block">\widehat{\theta}_{r+1} = \widehat{\theta}_{r} + \eta_{r} \mathbf{d}_r\left(\widehat{\theta}\right)</math>

where the vector <math>\mathbf{d}_{r}\left(\widehat{\theta}\right)</math> indicates the [[descent direction]] of the <var>r</var>th "step," and the scalar <math>\eta_{r}</math> captures the "step length,"<ref>{{cite book |first=Carlos |last=Daganzo |title=Multinomial Probit: The Theory and its Application to Demand Forecasting |location=New York |publisher=Academic Press |year=1979 |isbn=0-12-201150-3 |pages=61–78 }}</ref><ref>{{cite book |first1=William |last1=Gould |first2=Jeffrey |last2=Pitblado |first3=Brian |last3=Poi |title=Maximum Likelihood Estimation with Stata |location=College Station |publisher=Stata Press |year=2010 |edition=Fourth |isbn=978-1-59718-078-8 |pages=13–20 }}</ref> also known as the [[learning rate]].<ref>{{cite book |first=Kevin P. |last=Murphy |title=Machine Learning: A Probabilistic Perspective |location=Cambridge |publisher=MIT Press |year=2012 |isbn=978-0-262-01802-9 |page=247 |url=https://books.google.com/books?id=NZP6AQAAQBAJ&pg=PA247 }}</ref>

=== [[Gradient descent]] method ===
(Note: here it is a maximization problem, so the sign before gradient is flipped)

<math display="block">\eta_r\in \R^+</math>  that is small enough for convergence and <math>\mathbf{d}_r\left(\widehat{\theta}\right) = \nabla\ell\left(\widehat{\theta}_r;\mathbf{y}\right)</math>

Gradient descent method requires to calculate the gradient at the rth iteration, but no need to calculate the inverse of second-order derivative, i.e., the Hessian matrix. Therefore, it is computationally faster than Newton-Raphson method.

=== [[Newton's method|Newton–Raphson method]] ===
<math display="block">\eta_r = 1</math> and <math>\mathbf{d}_r\left(\widehat{\theta}\right) = -\mathbf{H}^{-1}_r\left(\widehat{\theta}\right) \mathbf{s}_r\left(\widehat{\theta}\right)</math>

where <math>\mathbf{s}_{r}(\widehat{\theta})</math> is the [[Score (statistics)|score]] and <math>\mathbf{H}^{-1}_r \left(\widehat{\theta}\right)</math> is the [[Invertible matrix|inverse]] of the [[Hessian matrix]] of the log-likelihood function, both evaluated the <var>r</var>th iteration.<ref>{{cite book |first=Takeshi |last=Amemiya |author-link=Takeshi Amemiya |title=Advanced Econometrics |location=Cambridge |publisher=Harvard University Press |year=1985 |isbn=0-674-00560-0 |pages=[https://archive.org/details/advancedeconomet00amem/page/137 137–138] |url=https://archive.org/details/advancedeconomet00amem/page/137 }}</ref><ref>{{cite book |first=Denis |last=Sargan |author-link=Denis Sargan |chapter=Methods of Numerical Optimization |title=Lecture Notes on Advanced Econometric Theory |location=Oxford |publisher=Basil Blackwell |year=1988 |isbn=0-631-14956-2 |pages=161–169 }}</ref> But because the calculation of the Hessian matrix is [[Computational complexity|computationally costly]], numerous alternatives have been proposed. The popular [[Berndt–Hall–Hall–Hausman algorithm]] approximates the Hessian with the [[outer product]] of the expected gradient, such that

<math display="block">\mathbf{d}_r\left(\widehat{\theta}\right) = - \left[ \frac{1}{n} \sum_{t=1}^n \frac{\partial \ell(\theta;\mathbf{y})}{\partial \theta} \left( \frac{\partial \ell(\theta;\mathbf{y})}{\partial \theta} \right)^{\mathsf{T}} \right]^{-1} \mathbf{s}_r \left(\widehat{\theta}\right)</math>

=== [[Quasi-Newton method]]s ===
Other quasi-Newton methods use more elaborate secant updates to give approximation of Hessian matrix.

==== [[Davidon–Fletcher–Powell formula]] ====
DFP formula finds a solution that is symmetric, positive-definite and closest to the current approximate value of second-order derivative:
<math display="block">\mathbf{H}_{k+1} =
  \left(I - \gamma_k y_k s_k^\mathsf{T}\right) \mathbf{H}_k \left(I - \gamma_k s_k y_k^\mathsf{T}\right) + \gamma_k y_k y_k^\mathsf{T},
</math>

where

<math display="block">y_k = \nabla\ell(x_k + s_k) - \nabla\ell(x_k),</math>
<math display="block">\gamma_k = \frac{1}{y_k^T s_k},</math>
<math display="block">s_k = x_{k+1} - x_k.</math>

==== [[Broyden–Fletcher–Goldfarb–Shanno algorithm]] ====
BFGS also gives a solution that is symmetric and positive-definite:

<math display="block">B_{k+1} = B_k + \frac{y_k y_k^\mathsf{T}}{y_k^\mathsf{T} s_k} - \frac{B_k s_k s_k^\mathsf{T} B_k^\mathsf{T}}{s_k^\mathsf{T} B_k s_k}\ ,</math>

where

<math display="block">y_k = \nabla\ell(x_k + s_k) - \nabla\ell(x_k),</math>
<math display="block">s_k = x_{k+1} - x_k.</math>

BFGS method is not guaranteed to converge unless the function has a quadratic [[Taylor expansion]] near an optimum. However, BFGS can have acceptable performance even for non-smooth optimization instances

==== [[Scoring algorithm|Fisher's scoring]] ====
Another popular method is to replace the Hessian with the [[Fisher information matrix]], <math>\mathcal{I}(\theta) = \operatorname{\mathbb E}\left[\mathbf{H}_r \left(\widehat{\theta}\right)\right]</math>, giving us the Fisher scoring algorithm. This procedure is standard in the estimation of many methods, such as [[generalized linear models]].

Although popular, quasi-Newton methods may converge to a [[stationary point]] that is not necessarily a local or global maximum,<ref>See theorem 10.1 in
{{cite book
 |first=Mordecai |last=Avriel
 |year=1976
 |title=Nonlinear Programming: Analysis and Methods
 |pages=293–294
 |location=Englewood Cliffs, NJ
 |publisher=Prentice-Hall
 |isbn=978-0-486-43227-4
 |url=https://books.google.com/books?id=byF4Xb1QbvMC&pg=PA293
}}
</ref> but rather a local minimum or a [[saddle point]]. Therefore, it is important to assess the validity of the obtained solution to the likelihood equations, by verifying that the Hessian, evaluated at the solution, is both [[negative definite]] and [[well-conditioned]].<ref>
{{cite book
 |first1=Philip E. |last1=Gill
 |first2=Walter |last2=Murray
 |first3=Margaret H. |last3=Wright |author-link3=Margaret H. Wright
 |year=1981
 |title=Practical Optimization
 |location=London, UK
 |publisher=Academic Press
 |pages=[https://archive.org/details/practicaloptimiz00gill/page/n329 312]–313
 |isbn=0-12-283950-1
 |url=https://archive.org/details/practicaloptimiz00gill
 |url-access=limited
}}
</ref>