Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Stochastic gradient descent
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==== Variants ==== The popularity of ''Adam'' inspired many variants and enhancements. Some examples include: * Nesterov-enhanced gradients: ''NAdam'',<ref>{{Cite journal |last=Dozat |first=T. |date=2016 |title=Incorporating Nesterov Momentum into Adam |s2cid=70293087 |language=en}}</ref> ''FASFA''<ref>{{Cite journal |last=Naveen |first=Philip |date=2022-08-09 |title=FASFA: A Novel Next-Generation Backpropagation Optimizer |url=http://dx.doi.org/10.36227/techrxiv.20427852.v1 |access-date=2022-11-19 |doi=10.36227/techrxiv.20427852.v1 }}</ref> * varying interpretations of second-order information: ''Powerpropagation''<ref>{{Cite book |last=Whye |first=Schwarz, Jonathan Jayakumar, Siddhant M. Pascanu, Razvan Latham, Peter E. Teh, Yee |url=http://worldcat.org/oclc/1333722169 |title=Powerpropagation: A sparsity inducing weight reparameterisation |date=2021-10-01 |oclc=1333722169}}</ref> and ''AdaSqrt''.<ref>{{Cite journal |last1=Hu |first1=Yuzheng |last2=Lin |first2=Licong |last3=Tang |first3=Shange |date=2019-12-20 |title=Second-order Information in First-order Optimization Methods |arxiv=1912.09926 }}</ref> * Using [[Uniform norm|infinity norm]]: ''AdaMax''<ref name="Adam2014" /> * ''AMSGrad'',<ref>{{Cite journal |last1=Reddi |first1=Sashank J. |last2=Kale |first2=Satyen |last3=Kumar |first3=Sanjiv |date=2018 |title=On the Convergence of Adam and Beyond |arxiv=1904.09237 }}</ref> which improves convergence over ''Adam'' by using maximum of past squared gradients instead of the exponential average.<ref>{{cite web | url=https://www.ruder.io/optimizing-gradient-descent/#amsgrad | title=An overview of gradient descent optimization algorithms | date=19 January 2016 }}</ref> ''AdamX''<ref>{{Cite journal |last1=Tran |first1=Phuong Thi |last2=Phong |first2=Le Trieu |date=2019 |title=On the Convergence Proof of AMSGrad and a New Version |url=https://ieeexplore.ieee.org/document/8713445 |journal=IEEE Access |volume=7 |pages=61706β61716 |doi=10.1109/ACCESS.2019.2916341 |issn=2169-3536|arxiv=1904.03590 |bibcode=2019IEEEA...761706T }}</ref> further improves convergence over ''AMSGrad''. * ''AdamW'',<ref name="AdamW">{{cite journal |last1=Loshchilov |first1=Ilya |last2=Hutter |first2=Frank |date=4 January 2019 |title=Decoupled Weight Decay Regularization |arxiv=1711.05101}}</ref> which improves the [[weight decay]].
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)