Editing Stochastic gradient descent (section)

==== Variants ====
The popularity of ''Adam'' inspired many variants and enhancements. Some examples include:

* Nesterov-enhanced gradients: ''NAdam'',<ref>{{Cite journal |last=Dozat |first=T. |date=2016 |title=Incorporating Nesterov Momentum into Adam |s2cid=70293087 |language=en}}</ref> ''FASFA''<ref>{{Cite journal |last=Naveen |first=Philip |date=2022-08-09 |title=FASFA: A Novel Next-Generation Backpropagation Optimizer |url=http://dx.doi.org/10.36227/techrxiv.20427852.v1 |access-date=2022-11-19 |doi=10.36227/techrxiv.20427852.v1 }}</ref>
* varying interpretations of second-order information: ''Powerpropagation''<ref>{{Cite book |last=Whye |first=Schwarz, Jonathan Jayakumar, Siddhant M. Pascanu, Razvan Latham, Peter E. Teh, Yee |url=http://worldcat.org/oclc/1333722169 |title=Powerpropagation: A sparsity inducing weight reparameterisation |date=2021-10-01 |oclc=1333722169}}</ref> and ''AdaSqrt''.<ref>{{Cite journal |last1=Hu |first1=Yuzheng |last2=Lin |first2=Licong |last3=Tang |first3=Shange |date=2019-12-20 |title=Second-order Information in First-order Optimization Methods |arxiv=1912.09926 }}</ref>
* Using [[Uniform norm|infinity norm]]: ''AdaMax''<ref name="Adam2014" /> 
* ''AMSGrad'',<ref>{{Cite journal |last1=Reddi |first1=Sashank J. |last2=Kale |first2=Satyen |last3=Kumar |first3=Sanjiv |date=2018 |title=On the Convergence of Adam and Beyond |arxiv=1904.09237 }}</ref> which improves convergence over ''Adam'' by using maximum of past squared gradients instead of the exponential average.<ref>{{cite web | url=https://www.ruder.io/optimizing-gradient-descent/#amsgrad | title=An overview of gradient descent optimization algorithms | date=19 January 2016 }}</ref> ''AdamX''<ref>{{Cite journal |last1=Tran |first1=Phuong Thi |last2=Phong |first2=Le Trieu |date=2019 |title=On the Convergence Proof of AMSGrad and a New Version |url=https://ieeexplore.ieee.org/document/8713445 |journal=IEEE Access |volume=7 |pages=61706–61716 |doi=10.1109/ACCESS.2019.2916341 |issn=2169-3536|arxiv=1904.03590 |bibcode=2019IEEEA...761706T }}</ref> further improves convergence over ''AMSGrad''.
* ''AdamW'',<ref name="AdamW">{{cite journal |last1=Loshchilov |first1=Ilya |last2=Hutter |first2=Frank |date=4 January 2019 |title=Decoupled Weight Decay Regularization |arxiv=1711.05101}}</ref> which improves the [[weight decay]].