Editing Stochastic gradient descent (section)

===Adam===
''Adam''<ref name="Adam2014">{{cite arXiv |first1=Diederik |last1=Kingma |first2=Jimmy |last2=Ba |eprint=1412.6980 |title=Adam: A Method for Stochastic Optimization |year=2014 |class=cs.LG }}</ref> (short for Adaptive Moment Estimation) is a 2014 update to the ''RMSProp'' optimizer combining it with the main feature of the ''Momentum method''.<ref>{{cite web | url=https://www.oreilly.com/library/view/fundamentals-of-deep/9781491925607/ch04.html | title=4. Beyond Gradient Descent - Fundamentals of Deep Learning &#91;Book&#93; }}</ref> In this optimization algorithm, running averages with exponential forgetting of both the gradients and the second moments of the gradients are used. Given parameters <math> w^ {(t)} </math> and a loss function <math> L ^ {(t)} </math>, where <math> t </math> indexes the current training iteration (indexed at <math> 1 </math>), Adam's parameter update is given by:

<math display="block">m_w ^ {(t)} := \beta_1 m_w ^ {(t-1)} + \left(1 - \beta_1\right) \nabla _w L ^ {(t-1)} </math>
<math display="block">v_w ^ {(t)} := \beta_2 v_w ^ {(t-1)} + \left(1 - \beta_2\right) \left(\nabla _w L ^ {(t-1)} \right)^2 </math>

<math display="block">\hat{m}_w ^ {(t)} = \frac{m_w ^ {(t)}}{1 - \beta_1^t} </math>
<math display="block">\hat{v}_w ^ {(t)} = \frac{ v_w ^ {(t)}}{1 - \beta_2^t} </math>

<math display="block">w ^ {(t)} := w ^ {(t-1)} - \eta \frac{\hat{m}_w^{(t)}}{\sqrt{\hat{v}_w^{(t)}} + \varepsilon} </math>
where <math>\varepsilon</math> is a small scalar (e.g. <math>10^{-8}</math>) used to prevent division by 0, and <math>\beta_1</math> (e.g. 0.9) and <math>\beta_2</math> (e.g. 0.999) are the forgetting factors for gradients and second moments of gradients, respectively. Squaring and square-rooting is done element-wise.

As the exponential moving averages of the gradient <math> m_w ^ {(t)}</math> and the squared gradient <math> v_w ^ {(t)}</math> are initialized with a vector of 0's, there would be a bias towards zero in the first training iterations. A factor <math>\tfrac{1}{1 - \beta_{1/2}^t}</math> is introduced to compensate this bias and get better estimates <math>\hat{m}_w ^ {(t)}</math> and <math>\hat{v}_w ^ {(t)}</math>.

The initial proof establishing the convergence of Adam was incomplete, and subsequent analysis has revealed that Adam does not converge for all convex objectives.<ref>{{cite conference |last1=Reddi |first1=Sashank J. |last2=Kale |first2=Satyen |last3=Kumar |first3=Sanjiv |date=2018 |title=On the Convergence of Adam and Beyond |url=https://openreview.net/forum?id=ryQu7f-RZ |conference=6th International Conference on Learning Representations (ICLR 2018) |arxiv=1904.09237 |doi=}}</ref><ref>{{Cite thesis |last=Rubio |first=David Martínez |title=Convergence Analysis of an Adaptive Method of Gradient Descent |date=2017 |access-date=5 January 2024 |degree=Master |publisher=University of Oxford |url=https://damaru2.github.io/convergence_analysis_hypergradient_descent/dissertation_hypergradients.pdf}}</ref> Despite this, ''Adam'' continues to be used due to its strong performance in practice.<ref>{{cite conference |last1=Zhang |first1=Yushun |last2=Chen |first2=Congliang |last3=Shi |first3=Naichen |last4=Sun |first4=Ruoyu |last5=Luo |first5=Zhi-Quan |date=2022 |title=Adam Can Converge Without Any Modification On Update Rules |conference=Advances in Neural Information Processing Systems 35 (NeurIPS 2022) |arxiv=2208.09632 |book-title=Advances in Neural Information Processing Systems 35}}</ref>

==== Variants ====
The popularity of ''Adam'' inspired many variants and enhancements. Some examples include:

* Nesterov-enhanced gradients: ''NAdam'',<ref>{{Cite journal |last=Dozat |first=T. |date=2016 |title=Incorporating Nesterov Momentum into Adam |s2cid=70293087 |language=en}}</ref> ''FASFA''<ref>{{Cite journal |last=Naveen |first=Philip |date=2022-08-09 |title=FASFA: A Novel Next-Generation Backpropagation Optimizer |url=http://dx.doi.org/10.36227/techrxiv.20427852.v1 |access-date=2022-11-19 |doi=10.36227/techrxiv.20427852.v1 }}</ref>
* varying interpretations of second-order information: ''Powerpropagation''<ref>{{Cite book |last=Whye |first=Schwarz, Jonathan Jayakumar, Siddhant M. Pascanu, Razvan Latham, Peter E. Teh, Yee |url=http://worldcat.org/oclc/1333722169 |title=Powerpropagation: A sparsity inducing weight reparameterisation |date=2021-10-01 |oclc=1333722169}}</ref> and ''AdaSqrt''.<ref>{{Cite journal |last1=Hu |first1=Yuzheng |last2=Lin |first2=Licong |last3=Tang |first3=Shange |date=2019-12-20 |title=Second-order Information in First-order Optimization Methods |arxiv=1912.09926 }}</ref>
* Using [[Uniform norm|infinity norm]]: ''AdaMax''<ref name="Adam2014" /> 
* ''AMSGrad'',<ref>{{Cite journal |last1=Reddi |first1=Sashank J. |last2=Kale |first2=Satyen |last3=Kumar |first3=Sanjiv |date=2018 |title=On the Convergence of Adam and Beyond |arxiv=1904.09237 }}</ref> which improves convergence over ''Adam'' by using maximum of past squared gradients instead of the exponential average.<ref>{{cite web | url=https://www.ruder.io/optimizing-gradient-descent/#amsgrad | title=An overview of gradient descent optimization algorithms | date=19 January 2016 }}</ref> ''AdamX''<ref>{{Cite journal |last1=Tran |first1=Phuong Thi |last2=Phong |first2=Le Trieu |date=2019 |title=On the Convergence Proof of AMSGrad and a New Version |url=https://ieeexplore.ieee.org/document/8713445 |journal=IEEE Access |volume=7 |pages=61706–61716 |doi=10.1109/ACCESS.2019.2916341 |issn=2169-3536|arxiv=1904.03590 |bibcode=2019IEEEA...761706T }}</ref> further improves convergence over ''AMSGrad''.
* ''AdamW'',<ref name="AdamW">{{cite journal |last1=Loshchilov |first1=Ilya |last2=Hutter |first2=Frank |date=4 January 2019 |title=Decoupled Weight Decay Regularization |arxiv=1711.05101}}</ref> which improves the [[weight decay]].