Editing Gradient descent (section)

{{Short description|Optimization algorithm}}
{{About||the analytical method called "steepest descent"|Method of steepest descent}}
{{Machine learning}}
[[File:Gradient Descent in 2D.webm|thumb|right|Gradient Descent in 2D]]
'''Gradient descent''' is a method for unconstrained [[mathematical optimization]]. It is a [[:Category:First order methods|first-order]] [[Iterative algorithm|iterative]] [[algorithm]] for minimizing a [[differentiable function|differentiable]] [[multivariate function]].

The idea is to take repeated steps in the opposite direction of the [[gradient]] (or approximate gradient) of the function at the current point, because this is the direction of steepest descent. Conversely, stepping in the direction of the gradient will lead to a trajectory that maximizes that function; the procedure is then known as ''gradient ascent''.
It is particularly useful in machine learning for minimizing the cost or loss function.<ref name="auto">{{Cite book |last1=Boyd |first1=Stephen |url=http://dx.doi.org/10.1017/cbo9780511804441 |title=Convex Optimization |last2=Vandenberghe |first2=Lieven |date=2004-03-08 |publisher=Cambridge University Press |doi=10.1017/cbo9780511804441 |isbn=978-0-521-83378-3}}</ref> Gradient descent should not be confused with [[Local search (optimization)|local search]] algorithms, although both are [[Iterative method|iterative methods]] for [[Global optimization|optimization]].

Gradient descent is generally attributed to [[Augustin-Louis Cauchy]], who first suggested it in 1847.<ref>{{cite book|vauthors=((Lemaréchal, C.)) | title=Optimization Stories | veditors=((Grötschel, M.)) | date=1 January 2012 | chapter=Cauchy and the gradient method | series=Documenta Mathematica Series | publisher=EMS Press | volume=6 | edition=1st | pages=251–254 | chapter-url=https://www.math.uni-bielefeld.de/documenta/vol-ismp/40_lemarechal-claude.pdf| doi=10.4171/dms/6/27 | doi-access=free | isbn=978-3-936609-58-5|access-date=2020-01-26 |archive-date=2018-12-29 |archive-url=https://web.archive.org/web/20181229073335/https://www.math.uni-bielefeld.de/documenta/vol-ismp/40_lemarechal-claude.pdf |url-status=dead}}</ref> [[Jacques Hadamard]] independently proposed a similar method in 1907.<ref>{{Cite journal|last=Hadamard|first=Jacques|date=1908|title=Mémoire sur le problème d'analyse relatif à l'équilibre des plaques élastiques encastrées|journal=Mémoires présentés par divers savants éstrangers à l'Académie des Sciences de l'Institut de France|volume=33}}</ref><ref>{{cite journal |last1=Courant |first1=R. |title=Variational methods for the solution of problems of equilibrium and vibrations |journal=Bulletin of the American Mathematical Society |date=1943 |volume=49 |issue=1 |pages=1–23 |doi=10.1090/S0002-9904-1943-07818-4 |doi-access=free }}</ref> Its convergence properties for non-linear optimization problems were first studied by [[Haskell Curry]] in 1944,<ref>{{cite journal |first=Haskell B. |last=Curry |title=The Method of Steepest Descent for Non-linear Minimization Problems |journal=Quart. Appl. Math. |volume=2 |year=1944 |issue=3 |pages=258–261 |doi=10.1090/qam/10667 |doi-access=free }}</ref> with the method becoming increasingly well-studied and used in the following decades.<ref name="BP" /><ref name="AK82" />

A simple extension of gradient descent, [[stochastic gradient descent]], serves as the most basic algorithm used for training most [[deep neural network|deep network]]s today.