Editing Gradient descent (section)

===Momentum or ''heavy ball'' method===
Trying to break the zig-zag pattern of gradient descent, the ''momentum or heavy ball method'' uses a momentum term in analogy to a heavy ball sliding on the surface of values of the function being minimized,<ref name="BP" /> or to mass movement in [[Newtonian dynamics]] through a [[viscous]] medium in a [[conservative force]] field.<ref>{{cite journal|last1=Qian |first1=Ning |title=On the momentum term in gradient descent learning algorithms |journal=[[Neural Networks (journal)|Neural Networks]] |date=January 1999 |volume=12 |issue=1 |pages=145–151 |doi=10.1016/S0893-6080(98)00116-6 |pmid=12662723 |citeseerx=10.1.1.57.5612 |s2cid=2783597 }}</ref> Gradient descent with momentum remembers the solution update at each iteration, and determines the next update as a [[linear combination]] of the gradient and the previous update. For unconstrained quadratic minimization, a theoretical convergence rate bound of the heavy ball method is asymptotically the same as that for the optimal [[conjugate gradient method]].<ref name="BP" />

This technique is used in [[Stochastic gradient descent#Momentum|stochastic gradient descent]] and as an extension to the [[backpropagation]] algorithms used to train [[artificial neural network]]s.<ref>{{cite web|title=Momentum and Learning Rate Adaptation|url=http://www.willamette.edu/~gorr/classes/cs449/momrate.html|publisher=[[Willamette University]]|access-date=17 October 2014}}</ref><ref>{{cite web|author1=Geoffrey Hinton|author-link=Geoffrey Hinton|author2=Nitish Srivastava|author3=Kevin Swersky|title=The momentum method|url=https://www.coursera.org/lecture/neural-networks/the-momentum-method-Oya9a|website=[[Coursera]]|access-date=2 October 2018}} Part of a lecture series for the [[Coursera]] online course [https://www.coursera.org/learn/neural-networks Neural Networks for Machine Learning] {{Webarchive|url=https://web.archive.org/web/20161231174321/https://www.coursera.org/learn/neural-networks |date=2016-12-31 }}.</ref> In the direction of updating, stochastic gradient descent adds a stochastic property. The weights can be used to calculate the derivatives.