Editing Stochastic gradient descent (section)

===Second-order methods===
A stochastic analogue of the standard (deterministic) [[Newton's method in optimization|Newton–Raphson algorithm]] (a "second-order" method) provides an asymptotically optimal or near-optimal form of iterative optimization in the setting of stochastic approximation{{Citation needed|date=April 2020}}. A method that uses direct measurements of the [[Hessian matrix|Hessian matrices]] of the summands in the empirical risk function was developed by Byrd, Hansen, Nocedal, and Singer.<ref>{{cite journal |first1=R. H. |last1=Byrd |first2=S. L. |last2=Hansen |first3=J. |last3=Nocedal |first4=Y. |last4=Singer |title=A Stochastic Quasi-Newton method for Large-Scale Optimization |journal=SIAM Journal on Optimization |volume=26 |issue=2 |pages=1008–1031 |year=2016 |doi=10.1137/140954362 |arxiv=1401.7020 |s2cid=12396034 }}</ref> However, directly determining the required Hessian matrices for optimization may not be possible in practice. Practical and theoretically sound methods for second-order versions of SGD that do not require direct Hessian information are given by Spall and others.<ref>{{cite journal |first=J. C. |last=Spall |year=2000 |title=Adaptive Stochastic Approximation by the Simultaneous Perturbation Method |journal=IEEE Transactions on Automatic Control |volume=45 |issue= 10|pages=1839−1853 |doi=10.1109/TAC.2000.880982 }}</ref><ref>{{cite journal |first=J. C. |last=Spall |year=2009 |title=Feedback and Weighting Mechanisms for Improving Jacobian Estimates in the Adaptive Simultaneous Perturbation Algorithm |journal=IEEE Transactions on Automatic Control |volume=54 |issue=6 |pages=1216–1229 |doi=10.1109/TAC.2009.2019793 |s2cid=3564529 }}</ref><ref>{{cite book |first1=S. |last1=Bhatnagar |first2=H. L. |last2=Prasad |first3=L. A. |last3=Prashanth |year=2013 |title=Stochastic Recursive Algorithms for Optimization: Simultaneous Perturbation Methods |location=London |publisher=Springer |isbn=978-1-4471-4284-3 }}</ref> (A less efficient method based on finite differences, instead of simultaneous perturbations, is given by Ruppert.<ref>{{cite journal |first=D. |last=Ruppert |title=A Newton-Raphson Version of the Multivariate Robbins-Monro Procedure |journal=[[Annals of Statistics]] |volume=13 |issue=1 |pages=236–245 |year=1985 |doi=10.1214/aos/1176346589 |doi-access=free }}</ref>) Another approach to the approximation Hessian matrix is replacing it with the Fisher information matrix, which transforms usual gradient to natural.<ref>{{cite journal |first1 = S. |last1 = Amari |year=1998 |title = Natural gradient works efficiently in learning |journal = Neural Computation| volume=10 | issue=2 |pages=251–276 |doi=10.1162/089976698300017746|s2cid = 207585383 }}</ref> These methods not requiring direct Hessian information are based on either values of the summands in the above empirical risk function or values of the gradients of the summands (i.e., the SGD inputs). In particular, second-order optimality is asymptotically achievable without direct calculation of the Hessian matrices of the summands in the empirical risk function. When the objective is a [[Non-linear least squares| nonlinear least-squres]] loss
<math display="block"> Q(w) = \frac{1}{n} \sum_{i=1}^n Q_i(w) = \frac{1}{n} \sum_{i=1}^n (m(w;x_i)-y_i)^2, </math> 
where <math>m(w;x_i)</math> is the predictive model (e.g., a [[Neural network (machine learning)|deep neural network]])
the objective's structure can be exploited to estimate 2nd order information using gradients only. The resulting 
methods are simple and often effective<ref>{{cite conference |last1=Brust |first1=J.J. |date=2021 |title=Nonlinear least squares for large-scale machine learning using stochastic Jacobian estimates |conference=ICML 2021 |arxiv=2107.05598 |book-title=Workshop: Beyond First Order Methods in Machine Learning}}</ref>