Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Stochastic gradient descent
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
===Second-order methods=== A stochastic analogue of the standard (deterministic) [[Newton's method in optimization|Newton–Raphson algorithm]] (a "second-order" method) provides an asymptotically optimal or near-optimal form of iterative optimization in the setting of stochastic approximation{{Citation needed|date=April 2020}}. A method that uses direct measurements of the [[Hessian matrix|Hessian matrices]] of the summands in the empirical risk function was developed by Byrd, Hansen, Nocedal, and Singer.<ref>{{cite journal |first1=R. H. |last1=Byrd |first2=S. L. |last2=Hansen |first3=J. |last3=Nocedal |first4=Y. |last4=Singer |title=A Stochastic Quasi-Newton method for Large-Scale Optimization |journal=SIAM Journal on Optimization |volume=26 |issue=2 |pages=1008–1031 |year=2016 |doi=10.1137/140954362 |arxiv=1401.7020 |s2cid=12396034 }}</ref> However, directly determining the required Hessian matrices for optimization may not be possible in practice. Practical and theoretically sound methods for second-order versions of SGD that do not require direct Hessian information are given by Spall and others.<ref>{{cite journal |first=J. C. |last=Spall |year=2000 |title=Adaptive Stochastic Approximation by the Simultaneous Perturbation Method |journal=IEEE Transactions on Automatic Control |volume=45 |issue= 10|pages=1839−1853 |doi=10.1109/TAC.2000.880982 }}</ref><ref>{{cite journal |first=J. C. |last=Spall |year=2009 |title=Feedback and Weighting Mechanisms for Improving Jacobian Estimates in the Adaptive Simultaneous Perturbation Algorithm |journal=IEEE Transactions on Automatic Control |volume=54 |issue=6 |pages=1216–1229 |doi=10.1109/TAC.2009.2019793 |s2cid=3564529 }}</ref><ref>{{cite book |first1=S. |last1=Bhatnagar |first2=H. L. |last2=Prasad |first3=L. A. |last3=Prashanth |year=2013 |title=Stochastic Recursive Algorithms for Optimization: Simultaneous Perturbation Methods |location=London |publisher=Springer |isbn=978-1-4471-4284-3 }}</ref> (A less efficient method based on finite differences, instead of simultaneous perturbations, is given by Ruppert.<ref>{{cite journal |first=D. |last=Ruppert |title=A Newton-Raphson Version of the Multivariate Robbins-Monro Procedure |journal=[[Annals of Statistics]] |volume=13 |issue=1 |pages=236–245 |year=1985 |doi=10.1214/aos/1176346589 |doi-access=free }}</ref>) Another approach to the approximation Hessian matrix is replacing it with the Fisher information matrix, which transforms usual gradient to natural.<ref>{{cite journal |first1 = S. |last1 = Amari |year=1998 |title = Natural gradient works efficiently in learning |journal = Neural Computation| volume=10 | issue=2 |pages=251–276 |doi=10.1162/089976698300017746|s2cid = 207585383 }}</ref> These methods not requiring direct Hessian information are based on either values of the summands in the above empirical risk function or values of the gradients of the summands (i.e., the SGD inputs). In particular, second-order optimality is asymptotically achievable without direct calculation of the Hessian matrices of the summands in the empirical risk function. When the objective is a [[Non-linear least squares| nonlinear least-squres]] loss <math display="block"> Q(w) = \frac{1}{n} \sum_{i=1}^n Q_i(w) = \frac{1}{n} \sum_{i=1}^n (m(w;x_i)-y_i)^2, </math> where <math>m(w;x_i)</math> is the predictive model (e.g., a [[Neural network (machine learning)|deep neural network]]) the objective's structure can be exploited to estimate 2nd order information using gradients only. The resulting methods are simple and often effective<ref>{{cite conference |last1=Brust |first1=J.J. |date=2021 |title=Nonlinear least squares for large-scale machine learning using stochastic Jacobian estimates |conference=ICML 2021 |arxiv=2107.05598 |book-title=Workshop: Beyond First Order Methods in Machine Learning}}</ref>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)