Editing Gradient descent (section)

==Modifications==
Gradient descent can converge to a local minimum and slow down in a neighborhood of a [[saddle point]]. Even for unconstrained quadratic minimization, gradient descent develops a zig-zag pattern of subsequent iterates as iterations progress, resulting in slow convergence. Multiple modifications of gradient descent have been proposed to address these deficiencies.

===Fast gradient methods===
[[Yurii Nesterov]] has proposed<ref>{{cite book |first=Yurii |last=Nesterov |author-link=Yurii Nesterov |title=Introductory Lectures on Convex Optimization : A Basic Course |publisher=Springer |year=2004 |isbn=1-4020-7553-7 }}</ref> a simple modification that enables faster convergence for convex problems and has been since further generalized. For unconstrained smooth problems, the method is called the [[fast gradient method]] (FGM) or the [[accelerated gradient method]] (AGM). Specifically, if the differentiable function <math>F</math> is convex and <math>\nabla F</math> is [[Lipschitz continuity|Lipschitz]], and it is not assumed that <math>F</math> is [[Convex function#Strongly convex functions|strongly convex]], then the error in the objective value generated at each step <math>k</math> by the gradient descent method will be [[Big O notation|bounded by]] <math display="inline">\mathcal{O}\left({k^{-1}}\right)</math>. Using the Nesterov acceleration technique, the error decreases at <math display="inline">\mathcal{O}\left({k^{-2}}\right)</math>.<ref>{{cite web |url=https://www.seas.ucla.edu/~vandenbe/236C/lectures/fgrad.pdf |title=Fast Gradient Methods |work=Lecture notes for EE236C at UCLA |first=Lieven |last=Vandenberghe |date=2019 }}</ref><ref>{{Cite journal |last=Walkington |first=Noel J. |date=2023 |title=Nesterov's Method for Convex Optimization |url=https://epubs.siam.org/doi/10.1137/21M1390037 |journal=SIAM Review |language=en |volume=65 |issue=2 |pages=539–562 |doi=10.1137/21M1390037 |issn=0036-1445}}</ref> It is known that the rate <math>\mathcal{O}\left({k^{-2}}\right)</math> for the decrease of the [[loss function|cost function]] is optimal for first-order optimization methods. Nevertheless, there is the opportunity to improve the algorithm by reducing the constant factor. The [[optimized gradient method]] (OGM)<ref>{{cite journal |first1=D. |last1=Kim |first2=J. A. |last2=Fessler |title=Optimized First-order Methods for Smooth Convex Minimization |journal=[[Mathematical Programming]] |volume=151 |issue=1–2 |pages=81–107 |year=2016 |doi=10.1007/s10107-015-0949-3 |pmid=27765996 |pmc=5067109 |arxiv=1406.5468 |s2cid=207055414 }}</ref> reduces that constant by a factor of two and is an optimal first-order method for large-scale problems.<ref>{{cite journal |first=Yoel |last=Drori |date=2017 |title=The Exact Information-based Complexity of Smooth Convex Minimization |journal=Journal of Complexity |volume=39 |pages=1–16 |doi=10.1016/j.jco.2016.11.001 |arxiv=1606.01424 |s2cid=205861966 }}</ref>

For constrained or non-smooth problems, Nesterov's FGM is called the [[fast proximal gradient method]] (FPGM), an acceleration of the [[proximal gradient method]].

===Momentum or ''heavy ball'' method===
Trying to break the zig-zag pattern of gradient descent, the ''momentum or heavy ball method'' uses a momentum term in analogy to a heavy ball sliding on the surface of values of the function being minimized,<ref name="BP" /> or to mass movement in [[Newtonian dynamics]] through a [[viscous]] medium in a [[conservative force]] field.<ref>{{cite journal|last1=Qian |first1=Ning |title=On the momentum term in gradient descent learning algorithms |journal=[[Neural Networks (journal)|Neural Networks]] |date=January 1999 |volume=12 |issue=1 |pages=145–151 |doi=10.1016/S0893-6080(98)00116-6 |pmid=12662723 |citeseerx=10.1.1.57.5612 |s2cid=2783597 }}</ref> Gradient descent with momentum remembers the solution update at each iteration, and determines the next update as a [[linear combination]] of the gradient and the previous update. For unconstrained quadratic minimization, a theoretical convergence rate bound of the heavy ball method is asymptotically the same as that for the optimal [[conjugate gradient method]].<ref name="BP" />

This technique is used in [[Stochastic gradient descent#Momentum|stochastic gradient descent]] and as an extension to the [[backpropagation]] algorithms used to train [[artificial neural network]]s.<ref>{{cite web|title=Momentum and Learning Rate Adaptation|url=http://www.willamette.edu/~gorr/classes/cs449/momrate.html|publisher=[[Willamette University]]|access-date=17 October 2014}}</ref><ref>{{cite web|author1=Geoffrey Hinton|author-link=Geoffrey Hinton|author2=Nitish Srivastava|author3=Kevin Swersky|title=The momentum method|url=https://www.coursera.org/lecture/neural-networks/the-momentum-method-Oya9a|website=[[Coursera]]|access-date=2 October 2018}} Part of a lecture series for the [[Coursera]] online course [https://www.coursera.org/learn/neural-networks Neural Networks for Machine Learning] {{Webarchive|url=https://web.archive.org/web/20161231174321/https://www.coursera.org/learn/neural-networks |date=2016-12-31 }}.</ref> In the direction of updating, stochastic gradient descent adds a stochastic property. The weights can be used to calculate the derivatives.