Editing Gradient descent (section)

==Solution of a linear system==

[[File:Steepest descent.png|thumb|380px|The steepest descent algorithm applied to the [[Wiener filter]]<ref>Haykin, Simon S. Adaptive filter theory. Pearson Education India, 2008. - p. 108-142, 217-242</ref>]]

Gradient descent can be used to solve a [[system of linear equations]]

:<math>A\mathbf{x}-\mathbf{b}=0</math>

reformulated as a quadratic minimization problem.
If the system matrix <math>A</math> is real [[Symmetric matrix|symmetric]] and [[positive-definite matrix|positive-definite]], an objective function is defined as the quadratic function, with minimization of

:<math>F(\mathbf{x})=\mathbf{x}^T A\mathbf{x}-2\mathbf{x}^T\mathbf{b},</math>

so that

:<math>\nabla F(\mathbf{x})=2(A\mathbf{x}-\mathbf{b}).</math>

For a general real matrix <math>A</math>, [[linear least squares]] define

:<math>F(\mathbf{x})=\left\|A\mathbf{x}-\mathbf{b}\right\|^2.</math>

In traditional linear least squares for real <math>A</math> and <math>\mathbf{b}</math> the [[Euclidean norm]] is used, in which case

:<math>\nabla F(\mathbf{x})=2A^T(A\mathbf{x}-\mathbf{b}).</math>

The [[line search]] minimization, finding the locally optimal step size <math>\gamma</math> on every iteration, can be performed analytically for quadratic functions, and explicit formulas for the locally optimal <math>\gamma</math> are known.<ref name="BP">{{cite book |author-link=Boris T. Polyak |last=Polyak |first=Boris |title=Introduction to Optimization |year=1987 |language=en |url=https://www.researchgate.net/publication/342978480 }}</ref><ref name=saad1996iterative>{{cite book|last=Saad|first=Yousef|title=Iterative methods for sparse linear systems|year=2003|publisher=Society for Industrial and Applied Mathematics|location=Philadelphia, Pa.|isbn=978-0-89871-534-7|pages=[https://archive.org/details/iterativemethods0000saad/page/195 195]|edition=2nd|url=https://archive.org/details/iterativemethods0000saad/page/195}}</ref>

For example, for real [[Symmetric matrix|symmetric]] and [[positive-definite matrix|positive-definite]] matrix <math>A</math>, a simple algorithm can be as follows,<ref name="BP" />
:<math>\begin{align}
& \text{repeat in the loop:} \\
& \qquad \mathbf{r} := \mathbf{b} - \mathbf{A x} \\
& \qquad \gamma := {\mathbf{r}^\mathsf{T} \mathbf{r}}/{\mathbf{r}^\mathsf{T} \mathbf{A r}}  \\
& \qquad \mathbf{x} := \mathbf{x} + \gamma \mathbf{r} \\
& \qquad \hbox{if } \mathbf{r}^\mathsf{T} \mathbf{r} \text{ is sufficiently small, then exit loop} \\
& \text{end repeat loop} \\
& \text{return } \mathbf{x} \text{ as the result}
\end{align}</math>

To avoid multiplying by <math>A</math> twice per iteration,
we note that <math>\mathbf{x} := \mathbf{x} + \gamma \mathbf{r}</math> implies <math>\mathbf{r} := \mathbf{r} - \gamma \mathbf{A r}</math>, which gives the traditional algorithm,<ref name=":0">{{cite journal |first1=Henricus |last1=Bouwmeester |first2=Andrew |last2=Dougherty |first3=Andrew V. |last3=Knyazev |title=Nonsymmetric Preconditioning for Conjugate Gradient and Steepest Descent Methods |journal=Procedia Computer Science |volume=51 |pages=276–285 |year=2015 |doi=10.1016/j.procs.2015.05.241 |doi-access=free |arxiv=1212.6680 }}</ref>
:<math>\begin{align}
& \mathbf{r} := \mathbf{b} - \mathbf{A x} \\
& \text{repeat in the loop:} \\
& \qquad \gamma := {\mathbf{r}^\mathsf{T} \mathbf{r}}/{\mathbf{r}^\mathsf{T} \mathbf{A r}}  \\
& \qquad \mathbf{x} := \mathbf{x} + \gamma \mathbf{r} \\
& \qquad \hbox{if } \mathbf{r}^\mathsf{T} \mathbf{r} \text{ is sufficiently small, then exit loop} \\
& \qquad \mathbf{r} := \mathbf{r} - \gamma \mathbf{A r} \\
& \text{end repeat loop} \\
& \text{return } \mathbf{x} \text{ as the result}
\end{align}</math>

The method is rarely used for solving linear equations, with the [[conjugate gradient method]] being one of the most popular alternatives. The number of gradient descent iterations is commonly proportional to the spectral [[condition number]] <math>\kappa(A)</math> of the system matrix <math>A</math> (the ratio of the maximum to minimum [[eigenvalues]] of {{nowrap|<math>A^TA</math>)}}, while the convergence of [[conjugate gradient method]] is typically determined by a square root of the condition number, i.e., is much faster. Both methods can benefit from [[Preconditioner|preconditioning]], where gradient descent may require less assumptions on the preconditioner.<ref name=":0" />

=== Geometric behavior and residual orthogonality ===
In steepest descent applied to solving <math> A \vec{x} = \vec{b} </math>, where <math> A </math> is symmetric positive-definite, the residual vectors <math> \vec{r}_k = \vec{b} - A\vec{x}_k </math> are orthogonal across iterations:

:<math>
\vec{r}_{k+1} \cdot \vec{r}_k = 0.
</math>

Because each step is taken in the steepest direction, steepest-descent steps
alternate between directions aligned with the extreme axes of the elongated
level sets.  When <math>\kappa(A)</math> is large, this produces a
characteristic zig-zag path. The poor conditioning of <math> A </math> is the primary cause of the slow convergence, and orthogonality of successive residuals reinforces this alternation.

[[File:Steepest descent convergence path for A = 2 2, 2 3.png|thumb|Convergence path of steepest descent method for A = [[2, 2], [2, 3]]]]

As shown in the image on the right, steepest descent converges slowly due to the high condition number of <math> A </math>, and the orthogonality of residuals forces each new direction to undo the overshoot from the previous step. The result is a path that zigzags toward the solution. This inefficiency is one reason conjugate gradient or preconditioning methods are preferred.<ref>{{Cite book | author1=Holmes, M. | title=Introduction to Scientific Computing and Data Analysis, 2nd Ed | year=2023 | publisher=Springer | isbn=978-3-031-22429-4 }}</ref>