Editing Hessian matrix (section)

=== Use in optimization ===

Hessian matrices are used in large-scale [[Mathematical optimization|optimization]] problems within [[Newton's method in optimization|Newton]]-type methods because they are the coefficient of the quadratic term of a local [[Taylor expansion]] of a function. That is,
<math display=block>y = f(\mathbf{x} + \Delta\mathbf{x})\approx f(\mathbf{x}) + \nabla f(\mathbf{x})^\mathsf{T} \Delta\mathbf{x} + \frac{1}{2} \, \Delta\mathbf{x}^\mathsf{T} \mathbf{H}(\mathbf{x}) \, \Delta\mathbf{x}</math>
where <math>\nabla f</math> is the [[gradient]] <math>\left(\frac{\partial f}{\partial x_1}, \ldots, \frac{\partial f}{\partial x_n}\right).</math> Computing and storing the full Hessian matrix takes [[Big theta|<math>\Theta\left(n^2\right)</math>]] memory, which is infeasible for high-dimensional functions such as the [[loss function]]s of [[Artificial neural network|neural nets]], [[conditional random field]]s, and other [[statistical model]]s with large numbers of parameters. For such situations, [[Truncated Newton method|truncated-Newton]] and [[Quasi-Newton method|quasi-Newton]] algorithms have been developed. The latter family of algorithms use approximations to the Hessian; one of the most popular quasi-Newton algorithms is [[Broyden–Fletcher–Goldfarb–Shanno algorithm|BFGS]].<ref>{{cite book|last1=Nocedal|first1=Jorge|author-link1=Jorge Nocedal|last2=Wright|first2=Stephen|year=2000|title=Numerical Optimization|isbn=978-0-387-98793-4|publisher=Springer Verlag}}</ref>

Such approximations may use the fact that an optimization algorithm uses the Hessian only as a [[linear operator]] <math>\mathbf{H}(\mathbf{v}),</math> and proceed by first noticing that the Hessian also appears in the local expansion of the gradient:
<math display=block>\nabla f (\mathbf{x} + \Delta\mathbf{x}) = \nabla f (\mathbf{x}) + \mathbf{H}(\mathbf{x}) \, \Delta\mathbf{x} + \mathcal{O}(\|\Delta\mathbf{x}\|^2)</math>

Letting <math>\Delta \mathbf{x} = r \mathbf{v}</math> for some scalar <math>r,</math> this gives
<math display=block>\mathbf{H}(\mathbf{x}) \, \Delta\mathbf{x} = \mathbf{H}(\mathbf{x})r\mathbf{v} = r\mathbf{H}(\mathbf{x})\mathbf{v} = \nabla f (\mathbf{x} + r\mathbf{v}) - \nabla f (\mathbf{x}) + \mathcal{O}(r^2),</math>
that is,
<math display=block>\mathbf{H}(\mathbf{x})\mathbf{v} = \frac{1}{r} \left[\nabla f(\mathbf{x} + r \mathbf{v}) - \nabla f(\mathbf{x})\right] + \mathcal{O}(r)</math>
so if the gradient is already computed, the approximate Hessian can be computed by a linear (in the size of the gradient) number of scalar operations. (While simple to program, this approximation scheme is not numerically stable since <math>r</math> has to be made small to prevent error due to the <math>\mathcal{O}(r)</math> term, but decreasing it loses precision in the first term.<ref>{{cite journal|last=Pearlmutter|first=Barak A.|title=Fast exact multiplication by the Hessian|journal=Neural Computation|volume=6|issue=1|year=1994|url=http://www.bcl.hamilton.ie/~barak/papers/nc-hessian.pdf|doi=10.1162/neco.1994.6.1.147|pages=147–160|s2cid=1251969 }}</ref>)

Notably regarding Randomized Search Heuristics, the [[evolution strategy]]'s covariance matrix adapts to the inverse of the Hessian matrix, [[up to]] a scalar factor and small random fluctuations.
This result has been formally proven for a single-parent strategy and a static model, as the population size increases, relying on the quadratic approximation.<ref>{{cite journal
| doi = 10.1016/j.tcs.2019.09.002
| first = O.M.
| last = Shir
| author2 = A. Yehudayoff
| title = On the covariance-Hessian relation in evolution strategies
| journal = Theoretical Computer Science
| volume = 801
| pages = 157–174
| publisher = Elsevier
| year = 2020
| doi-access = free
| arxiv = 1806.03674
}}</ref>