Editing Stochastic gradient descent (section)

===Momentum===
{{Anchor|Momentum|Nesterov}}

Further proposals include the ''momentum method'' or the ''heavy ball method'', which in ML context appeared in [[David Rumelhart|Rumelhart]], [[Geoffrey Hinton|Hinton]] and [[Ronald J. Williams|Williams]]' paper on backpropagation learning<ref name="Rumelhart1986">{{cite journal|last=Rumelhart|first=David E.|author2=Hinton, Geoffrey E.|author3=Williams, Ronald J.|title=Learning representations by back-propagating errors|journal=Nature|date=8 October 1986|volume=323|issue=6088|pages=533–536|doi=10.1038/323533a0|bibcode=1986Natur.323..533R|s2cid=205001834}}</ref> and borrowed the idea from Soviet mathematician Boris Polyak's 1964 article on solving functional equations.<ref>{{cite web | url=https://boostedml.com/2020/07/gradient-descent-and-momentum-the-heavy-ball-method.html | title=Gradient Descent and Momentum: The Heavy Ball Method | date=13 July 2020 }}</ref> Stochastic gradient descent with momentum remembers the update {{math|Δ''w''}} at each iteration, and determines the next update as a [[linear combination]] of the gradient and the previous update:<ref name="Sutskever2013">{{cite conference|last=Sutskever|first=Ilya|author2=Martens, James|author3=Dahl, George|author4=Hinton, Geoffrey E.|editor=Sanjoy Dasgupta and David Mcallester|title=On the importance of initialization and momentum in deep learning|conference=In Proceedings of the 30th international conference on machine learning (ICML-13)|date=June 2013|volume=28|location=Atlanta, GA|pages=1139–1147|url=http://www.cs.utoronto.ca/~ilya/pubs/2013/1051_2.pdf|access-date=14 January 2016}}</ref><ref name="SutskeverPhD">{{cite thesis|last=Sutskever|first=Ilya|title=Training recurrent neural networks|date=2013|publisher=University of Toronto|url=http://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf|type=Ph.D.|page=74}}</ref>
<math display="block">\Delta w := \alpha \Delta w - \eta\, \nabla Q_i(w)</math>
<math display="block">w := w + \Delta w </math>
that leads to:
<math display="block">w := w - \eta\, \nabla Q_i(w) + \alpha \Delta w </math>

where the [[parametric statistics|parameter]] <math>w</math> which minimizes <math>Q(w)</math> is to be [[estimator|estimated]], <math>\eta</math> is a step size (sometimes called the ''[[learning rate]]'' in machine learning) and <math>\alpha</math> is an exponential [[Learning rate#Learning rate schedule|decay factor]] between 0 and 1 that determines the relative contribution of the current gradient and earlier gradients to the weight change.

The name momentum stems from an analogy to [[momentum]] in physics: the weight vector <math>w</math>, thought of as a particle traveling through parameter space,{{r|Rumelhart1986}} incurs acceleration from the gradient of the loss ("[[force]]"). Unlike in classical stochastic gradient descent, it tends to keep traveling in the same direction, preventing oscillations. Momentum has been used successfully by computer scientists in the training of [[artificial neural networks]] for several decades.<ref name="Zeiler 2012">{{cite arXiv |last=Zeiler |first=Matthew D. |eprint=1212.5701 |title=ADADELTA: An adaptive learning rate method |year=2012|class=cs.LG }}</ref>
The ''momentum method''  is closely related to [[Langevin dynamics|underdamped Langevin dynamics]], and may be combined with [[simulated annealing]].<ref name="Borysenko2021">{{cite journal|last=Borysenko|first=Oleksandr|author2=Byshkin, Maksym|title=CoolMomentum: A Method for Stochastic Optimization by Langevin Dynamics with Simulated Annealing|journal=Scientific Reports|date=2021|volume=11|issue=1|pages=10705|doi=10.1038/s41598-021-90144-3|pmid=34021212|pmc=8139967|arxiv=2005.14605|bibcode=2021NatSR..1110705B}}</ref>

In mid-1980s the method was modified by [[Yurii Nesterov]] to use the gradient predicted at the next point, and the resulting so-called ''Nesterov Accelerated Gradient'' was sometimes used in ML in the 2010s.<ref>{{cite web | url=https://paperswithcode.com/method/nesterov-accelerated-gradient | title=Papers with Code - Nesterov Accelerated Gradient Explained }}</ref>