Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Stochastic gradient descent
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
===Adam=== ''Adam''<ref name="Adam2014">{{cite arXiv |first1=Diederik |last1=Kingma |first2=Jimmy |last2=Ba |eprint=1412.6980 |title=Adam: A Method for Stochastic Optimization |year=2014 |class=cs.LG }}</ref> (short for Adaptive Moment Estimation) is a 2014 update to the ''RMSProp'' optimizer combining it with the main feature of the ''Momentum method''.<ref>{{cite web | url=https://www.oreilly.com/library/view/fundamentals-of-deep/9781491925607/ch04.html | title=4. Beyond Gradient Descent - Fundamentals of Deep Learning [Book] }}</ref> In this optimization algorithm, running averages with exponential forgetting of both the gradients and the second moments of the gradients are used. Given parameters <math> w^ {(t)} </math> and a loss function <math> L ^ {(t)} </math>, where <math> t </math> indexes the current training iteration (indexed at <math> 1 </math>), Adam's parameter update is given by: <math display="block">m_w ^ {(t)} := \beta_1 m_w ^ {(t-1)} + \left(1 - \beta_1\right) \nabla _w L ^ {(t-1)} </math> <math display="block">v_w ^ {(t)} := \beta_2 v_w ^ {(t-1)} + \left(1 - \beta_2\right) \left(\nabla _w L ^ {(t-1)} \right)^2 </math> <math display="block">\hat{m}_w ^ {(t)} = \frac{m_w ^ {(t)}}{1 - \beta_1^t} </math> <math display="block">\hat{v}_w ^ {(t)} = \frac{ v_w ^ {(t)}}{1 - \beta_2^t} </math> <math display="block">w ^ {(t)} := w ^ {(t-1)} - \eta \frac{\hat{m}_w^{(t)}}{\sqrt{\hat{v}_w^{(t)}} + \varepsilon} </math> where <math>\varepsilon</math> is a small scalar (e.g. <math>10^{-8}</math>) used to prevent division by 0, and <math>\beta_1</math> (e.g. 0.9) and <math>\beta_2</math> (e.g. 0.999) are the forgetting factors for gradients and second moments of gradients, respectively. Squaring and square-rooting is done element-wise. As the exponential moving averages of the gradient <math> m_w ^ {(t)}</math> and the squared gradient <math> v_w ^ {(t)}</math> are initialized with a vector of 0's, there would be a bias towards zero in the first training iterations. A factor <math>\tfrac{1}{1 - \beta_{1/2}^t}</math> is introduced to compensate this bias and get better estimates <math>\hat{m}_w ^ {(t)}</math> and <math>\hat{v}_w ^ {(t)}</math>. The initial proof establishing the convergence of Adam was incomplete, and subsequent analysis has revealed that Adam does not converge for all convex objectives.<ref>{{cite conference |last1=Reddi |first1=Sashank J. |last2=Kale |first2=Satyen |last3=Kumar |first3=Sanjiv |date=2018 |title=On the Convergence of Adam and Beyond |url=https://openreview.net/forum?id=ryQu7f-RZ |conference=6th International Conference on Learning Representations (ICLR 2018) |arxiv=1904.09237 |doi=}}</ref><ref>{{Cite thesis |last=Rubio |first=David Martínez |title=Convergence Analysis of an Adaptive Method of Gradient Descent |date=2017 |access-date=5 January 2024 |degree=Master |publisher=University of Oxford |url=https://damaru2.github.io/convergence_analysis_hypergradient_descent/dissertation_hypergradients.pdf}}</ref> Despite this, ''Adam'' continues to be used due to its strong performance in practice.<ref>{{cite conference |last1=Zhang |first1=Yushun |last2=Chen |first2=Congliang |last3=Shi |first3=Naichen |last4=Sun |first4=Ruoyu |last5=Luo |first5=Zhi-Quan |date=2022 |title=Adam Can Converge Without Any Modification On Update Rules |conference=Advances in Neural Information Processing Systems 35 (NeurIPS 2022) |arxiv=2208.09632 |book-title=Advances in Neural Information Processing Systems 35}}</ref> ==== Variants ==== The popularity of ''Adam'' inspired many variants and enhancements. Some examples include: * Nesterov-enhanced gradients: ''NAdam'',<ref>{{Cite journal |last=Dozat |first=T. |date=2016 |title=Incorporating Nesterov Momentum into Adam |s2cid=70293087 |language=en}}</ref> ''FASFA''<ref>{{Cite journal |last=Naveen |first=Philip |date=2022-08-09 |title=FASFA: A Novel Next-Generation Backpropagation Optimizer |url=http://dx.doi.org/10.36227/techrxiv.20427852.v1 |access-date=2022-11-19 |doi=10.36227/techrxiv.20427852.v1 }}</ref> * varying interpretations of second-order information: ''Powerpropagation''<ref>{{Cite book |last=Whye |first=Schwarz, Jonathan Jayakumar, Siddhant M. Pascanu, Razvan Latham, Peter E. Teh, Yee |url=http://worldcat.org/oclc/1333722169 |title=Powerpropagation: A sparsity inducing weight reparameterisation |date=2021-10-01 |oclc=1333722169}}</ref> and ''AdaSqrt''.<ref>{{Cite journal |last1=Hu |first1=Yuzheng |last2=Lin |first2=Licong |last3=Tang |first3=Shange |date=2019-12-20 |title=Second-order Information in First-order Optimization Methods |arxiv=1912.09926 }}</ref> * Using [[Uniform norm|infinity norm]]: ''AdaMax''<ref name="Adam2014" /> * ''AMSGrad'',<ref>{{Cite journal |last1=Reddi |first1=Sashank J. |last2=Kale |first2=Satyen |last3=Kumar |first3=Sanjiv |date=2018 |title=On the Convergence of Adam and Beyond |arxiv=1904.09237 }}</ref> which improves convergence over ''Adam'' by using maximum of past squared gradients instead of the exponential average.<ref>{{cite web | url=https://www.ruder.io/optimizing-gradient-descent/#amsgrad | title=An overview of gradient descent optimization algorithms | date=19 January 2016 }}</ref> ''AdamX''<ref>{{Cite journal |last1=Tran |first1=Phuong Thi |last2=Phong |first2=Le Trieu |date=2019 |title=On the Convergence Proof of AMSGrad and a New Version |url=https://ieeexplore.ieee.org/document/8713445 |journal=IEEE Access |volume=7 |pages=61706–61716 |doi=10.1109/ACCESS.2019.2916341 |issn=2169-3536|arxiv=1904.03590 |bibcode=2019IEEEA...761706T }}</ref> further improves convergence over ''AMSGrad''. * ''AdamW'',<ref name="AdamW">{{cite journal |last1=Loshchilov |first1=Ilya |last2=Hutter |first2=Frank |date=4 January 2019 |title=Decoupled Weight Decay Regularization |arxiv=1711.05101}}</ref> which improves the [[weight decay]].
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)