Editing Gradient descent (section)

==Comments==

Gradient descent works in spaces of any number of dimensions, even in infinite-dimensional ones. In the latter case, the search space is typically a [[function space]], and one calculates the [[Fréchet derivative]] of the functional to be minimized to determine the descent direction.<ref name="AK82">{{cite book |first1=G. P. |last1=Akilov |first2=L. V. |last2=Kantorovich |author-link2=Leonid Kantorovich |title=Functional Analysis |publisher=Pergamon Press |edition=2nd |isbn=0-08-023036-9 |year=1982 }}</ref>

That gradient descent works in any number of dimensions (finite number at least) can be seen as a consequence of the [[Cauchy-Schwarz inequality]], i.e. the magnitude of the inner (dot) product of two vectors of any dimension is maximized when they are [[colinear]]. In the case of gradient descent, that would be when the vector of independent variable adjustments is proportional to the gradient vector of partial derivatives.

The gradient descent can take many iterations to compute a local minimum with a required [[accuracy]], if the [[curvature]] in different directions is very different for the given function. For such functions, [[preconditioning]], which changes the geometry of the space to shape the function level sets like [[concentric circles]], cures the slow convergence. Constructing and applying preconditioning can be computationally expensive, however.

The gradient descent can be modified via momentums<ref>{{Cite journal |last1=Abdulkadirov |first1=Ruslan |last2=Lyakhov |first2=Pavel |last3=Nagornov |first3=Nikolay |date=January 2023 |title=Survey of Optimization Algorithms in Modern Neural Networks |journal=Mathematics |language=en |volume=11 |issue=11 |pages=2466 |doi=10.3390/math11112466 |doi-access=free |issn=2227-7390}}</ref> ([[Nesterov]], Polyak,<ref>{{Cite journal |last1=Diakonikolas |first1=Jelena |last2=Jordan |first2=Michael I. |date=January 2021 |title=Generalized Momentum-Based Methods: A Hamiltonian Perspective |url=https://epubs.siam.org/doi/10.1137/20M1322716 |journal=SIAM Journal on Optimization |language=en |volume=31 |issue=1 |pages=915–944 |doi=10.1137/20M1322716 |arxiv=1906.00436 |issn=1052-6234}}</ref> and Frank-Wolfe<ref>{{Cite journal |last=Meyer |first=Gerard G. L. |date=November 1974 |title=Accelerated Frank–Wolfe Algorithms |url=http://epubs.siam.org/doi/10.1137/0312050 |journal=SIAM Journal on Control |language=en |volume=12 |issue=4 |pages=655–663 |doi=10.1137/0312050 |issn=0036-1402}}</ref>) and heavy-ball parameters (exponential moving averages<ref>{{Citation |last1=Kingma |first1=Diederik P. |title=Adam: A Method for Stochastic Optimization |date=2017-01-29 |last2=Ba |first2=Jimmy|arxiv=1412.6980 }}</ref> and positive-negative momentum<ref>{{Cite journal |last1=Xie |first1=Zeke |last2=Yuan |first2=Li |last3=Zhu |first3=Zhanxing |last4=Sugiyama |first4=Masashi |date=2021-07-01 |title=Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to Improve Generalization |url=https://proceedings.mlr.press/v139/xie21h.html |journal=Proceedings of the 38th International Conference on Machine Learning |language=en |publisher=PMLR |pages=11448–11458|arxiv=2103.17182 }}</ref>). The main examples of such optimizers are Adam, DiffGrad, Yogi, AdaBelief, etc.

Methods based on [[Newton's method in optimization|Newton's method]] and inversion of the [[Hessian matrix|Hessian]] using [[conjugate gradient]] techniques can be better alternatives.<ref>{{cite book |first1=W. H. |last1=Press |author-link1 = William H. Press |first2=S. A. |last2=Teukolsky |author-link2 = Saul Teukolsky |first3=W. T. |last3=Vetterling |first4=B. P. |last4=Flannery |author-link4 = Brian P. Flannery |title=Numerical Recipes in C: The Art of Scientific Computing |url=https://archive.org/details/numericalrecipes00pres_0 |url-access=registration |edition=2nd |publisher=[[Cambridge University Press]] |location=New York |year=1992 |isbn=0-521-43108-5 }}</ref><ref>{{cite book |first=T. |last=Strutz |title=Data Fitting and Uncertainty: A Practical Introduction to Weighted Least Squares and Beyond |edition=2nd |publisher=Springer Vieweg |year=2016 |isbn=978-3-658-11455-8 }}</ref> Generally, such methods converge in fewer iterations, but the cost of each iteration is higher. An example is the [[Broyden–Fletcher–Goldfarb–Shanno algorithm|BFGS method]] which consists in calculating on every step a matrix by which the gradient vector is multiplied to go into a "better" direction, combined with a more sophisticated [[line search]] algorithm, to find the "best" value of <math>\gamma.</math> For extremely large problems, where the computer-memory issues dominate, a limited-memory method such as [[Limited-memory BFGS|L-BFGS]] should be used instead of BFGS or the steepest descent. 

While it is sometimes possible to substitute gradient descent for a [[Local search (optimization)|local search]] algorithm, gradient descent is not in the same family: although it is an [[iterative method]] for [[Global optimization|local optimization]], it relies on an [[loss function|objective function’s gradient]] rather than an explicit exploration of a [[Feasible region|solution space]].

Gradient descent can be viewed as applying [[Euler's method]] for solving [[ordinary differential equations]] <math>x'(t)=-\nabla f(x(t))</math> to a [[gradient flow]].  In turn, this equation may be derived as an optimal controller<ref>{{cite journal |last1=Ross |first1=I.M. |title=An optimal control theory for nonlinear optimization |journal=Journal of Computational and Applied Mathematics |date=July 2019 |volume=354 |pages=39–51 |doi=10.1016/j.cam.2018.12.044 |s2cid=127649426 |doi-access=free }}</ref> for the control system <math>x'(t) = u(t)</math> with <math>u(t)</math> given in feedback form <math>u(t) = -\nabla f(x(t))</math>.