Editing Gradient descent (section)

=== Choosing the step size and descent direction ===
Since using a step size <math>\gamma</math> that is too small would slow convergence, and a <math>\gamma</math> too large would lead to overshoot and divergence, finding a good setting of <math>\gamma</math> is an important practical problem. [[Philip Wolfe (mathematician)|Philip Wolfe]] also advocated using "clever choices of the [descent] direction" in practice.<ref>{{cite journal |last1=Wolfe |first1=Philip |title=Convergence Conditions for Ascent Methods |journal=SIAM Review |date=April 1969 |volume=11 |issue=2 |pages=226–235 |doi=10.1137/1011036 }}</ref> While using a direction that deviates from the steepest descent direction may seem counter-intuitive, the idea is that the smaller slope may be compensated for by being sustained over a much longer distance.

To reason about this mathematically, consider a direction <math> \mathbf{p}_n</math> and step size <math> \gamma_n</math> and consider the more general update:
:<math> \mathbf{a}_{n+1} = \mathbf{a}_n-\gamma_n\,\mathbf{p}_n</math>.
Finding good settings of <math> \mathbf{p}_n</math> and <math> \gamma_n</math> requires some thought. First of all, we would like the update direction to point downhill. Mathematically, letting <math> \theta_n</math> denote the angle between <math>-\nabla F(\mathbf{a_n})</math> and <math> \mathbf{p}_n</math>, this requires that <math> \cos \theta_n > 0.</math> To say more, we need more information about the objective function that we are optimising. Under the fairly weak assumption that <math>F</math> is continuously differentiable, we may prove that:<ref>{{cite arXiv|last1=Bernstein|first1=Jeremy|last2=Vahdat|first2=Arash|last3=Yue|first3=Yisong|last4=Liu|first4=Ming-Yu|date=2020-06-12|title=On the distance between two neural networks and the stability of learning|class=cs.LG|eprint=2002.03432}}</ref>
{{NumBlk|:|<math> F(\mathbf{a}_{n+1}) \leq F(\mathbf{a}_n) - \gamma_n \|\nabla F(\mathbf{a}_n)\|_2 \|\mathbf{p}_n\|_2 \left[\cos \theta_n - \max_{t\in[0,1]} \frac{\|\nabla F(\mathbf{a}_n - t \gamma_n \mathbf{p}_n) - \nabla F(\mathbf{a}_n)\|_2}{\| \nabla F(\mathbf{a}_n) \|_2}\right]</math>|{{EquationRef|1}}}}
This inequality implies that the amount by which we can be sure the function <math>F</math> is decreased depends on a trade off between the two terms in square brackets. The first term in square brackets measures the angle between the descent direction and the negative gradient. The second term measures how quickly the gradient changes along the descent direction.

In principle inequality ({{EquationNote|1}}) could be optimized over <math> \mathbf{p}_n</math> and <math> \gamma_n</math> to choose an optimal step size and direction. The problem is that evaluating the second term in square brackets requires evaluating <math> \nabla F(\mathbf{a}_n - t \gamma_n \mathbf{p}_n)</math>, and extra gradient evaluations are generally expensive and undesirable. Some ways around this problem are:

* Forgo the benefits of a clever descent direction by setting <math>\mathbf{p}_n = \nabla F(\mathbf{a_n})</math>, and use [[line search]] to find a suitable step-size <math> \gamma_n</math>, such as one that satisfies the [[Wolfe conditions]]. A more economic way of choosing learning rates is [[backtracking line search]], a method that has both good theoretical guarantees and experimental results. Note that one does not need to choose  <math>\mathbf{p}_n </math> to be the gradient; any direction that has positive inner product with the gradient will result in a reduction of the function value (for a sufficiently small value of <math> \gamma_n</math>).
* Assuming that <math>F</math> is twice-differentiable, use its Hessian <math>\nabla^2 F</math> to estimate <math> \|\nabla F(\mathbf{a}_n - t \gamma_n \mathbf{p}_n) - \nabla F(\mathbf{a}_n)\|_2 \approx \| t \gamma_n \nabla^2 F(\mathbf{a}_n) \mathbf{p}_n\|.</math>Then choose <math> \mathbf{p}_n</math> and <math> \gamma_n</math> by optimising inequality ({{EquationNote|1}}).
* Assuming that <math>\nabla F</math> is [[Lipschitz continuity|Lipschitz]], use its Lipschitz constant <math> L</math> to bound <math> \|\nabla F(\mathbf{a}_n - t \gamma_n \mathbf{p}_n) - \nabla F(\mathbf{a}_n)\|_2 \leq L t \gamma_n \|\mathbf{p}_n\|.</math> Then choose <math> \mathbf{p}_n</math> and <math> \gamma_n</math> by optimising inequality ({{EquationNote|1}}).
* Build a custom model of <math> \max_{t\in[0,1]} \frac{\|\nabla F(\mathbf{a}_n - t \gamma_n \mathbf{p}_n) - \nabla F(\mathbf{a}_n)\|_2}{\| \nabla F(\mathbf{a}_n) \|_2}</math> for <math>F</math>. Then choose <math> \mathbf{p}_n</math> and <math> \gamma_n</math> by optimising inequality ({{EquationNote|1}}).
* Under stronger assumptions on the function <math>F</math> such as [[Convex function|convexity]], more [[#Fast gradient methods|advanced techniques]] may be possible.

Usually by following one of the recipes above, [[convergent series|convergence]] to a local minimum can be guaranteed. When the function <math>F</math> is [[Convex function|convex]], all local minima are also global minima, so in this case gradient descent can converge to the global solution.