Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Stochastic gradient descent
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
===Momentum=== {{Anchor|Momentum|Nesterov}} Further proposals include the ''momentum method'' or the ''heavy ball method'', which in ML context appeared in [[David Rumelhart|Rumelhart]], [[Geoffrey Hinton|Hinton]] and [[Ronald J. Williams|Williams]]' paper on backpropagation learning<ref name="Rumelhart1986">{{cite journal|last=Rumelhart|first=David E.|author2=Hinton, Geoffrey E.|author3=Williams, Ronald J.|title=Learning representations by back-propagating errors|journal=Nature|date=8 October 1986|volume=323|issue=6088|pages=533β536|doi=10.1038/323533a0|bibcode=1986Natur.323..533R|s2cid=205001834}}</ref> and borrowed the idea from Soviet mathematician Boris Polyak's 1964 article on solving functional equations.<ref>{{cite web | url=https://boostedml.com/2020/07/gradient-descent-and-momentum-the-heavy-ball-method.html | title=Gradient Descent and Momentum: The Heavy Ball Method | date=13 July 2020 }}</ref> Stochastic gradient descent with momentum remembers the update {{math|Ξ''w''}} at each iteration, and determines the next update as a [[linear combination]] of the gradient and the previous update:<ref name="Sutskever2013">{{cite conference|last=Sutskever|first=Ilya|author2=Martens, James|author3=Dahl, George|author4=Hinton, Geoffrey E.|editor=Sanjoy Dasgupta and David Mcallester|title=On the importance of initialization and momentum in deep learning|conference=In Proceedings of the 30th international conference on machine learning (ICML-13)|date=June 2013|volume=28|location=Atlanta, GA|pages=1139β1147|url=http://www.cs.utoronto.ca/~ilya/pubs/2013/1051_2.pdf|access-date=14 January 2016}}</ref><ref name="SutskeverPhD">{{cite thesis|last=Sutskever|first=Ilya|title=Training recurrent neural networks|date=2013|publisher=University of Toronto|url=http://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf|type=Ph.D.|page=74}}</ref> <math display="block">\Delta w := \alpha \Delta w - \eta\, \nabla Q_i(w)</math> <math display="block">w := w + \Delta w </math> that leads to: <math display="block">w := w - \eta\, \nabla Q_i(w) + \alpha \Delta w </math> where the [[parametric statistics|parameter]] <math>w</math> which minimizes <math>Q(w)</math> is to be [[estimator|estimated]], <math>\eta</math> is a step size (sometimes called the ''[[learning rate]]'' in machine learning) and <math>\alpha</math> is an exponential [[Learning rate#Learning rate schedule|decay factor]] between 0 and 1 that determines the relative contribution of the current gradient and earlier gradients to the weight change. The name momentum stems from an analogy to [[momentum]] in physics: the weight vector <math>w</math>, thought of as a particle traveling through parameter space,{{r|Rumelhart1986}} incurs acceleration from the gradient of the loss ("[[force]]"). Unlike in classical stochastic gradient descent, it tends to keep traveling in the same direction, preventing oscillations. Momentum has been used successfully by computer scientists in the training of [[artificial neural networks]] for several decades.<ref name="Zeiler 2012">{{cite arXiv |last=Zeiler |first=Matthew D. |eprint=1212.5701 |title=ADADELTA: An adaptive learning rate method |year=2012|class=cs.LG }}</ref> The ''momentum method'' is closely related to [[Langevin dynamics|underdamped Langevin dynamics]], and may be combined with [[simulated annealing]].<ref name="Borysenko2021">{{cite journal|last=Borysenko|first=Oleksandr|author2=Byshkin, Maksym|title=CoolMomentum: A Method for Stochastic Optimization by Langevin Dynamics with Simulated Annealing|journal=Scientific Reports|date=2021|volume=11|issue=1|pages=10705|doi=10.1038/s41598-021-90144-3|pmid=34021212|pmc=8139967|arxiv=2005.14605|bibcode=2021NatSR..1110705B}}</ref> In mid-1980s the method was modified by [[Yurii Nesterov]] to use the gradient predicted at the next point, and the resulting so-called ''Nesterov Accelerated Gradient'' was sometimes used in ML in the 2010s.<ref>{{cite web | url=https://paperswithcode.com/method/nesterov-accelerated-gradient | title=Papers with Code - Nesterov Accelerated Gradient Explained }}</ref>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)