Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Gradient descent
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Solution of a linear system== [[File:Steepest descent.png|thumb|380px|The steepest descent algorithm applied to the [[Wiener filter]]<ref>Haykin, Simon S. Adaptive filter theory. Pearson Education India, 2008. - p. 108-142, 217-242</ref>]] Gradient descent can be used to solve a [[system of linear equations]] :<math>A\mathbf{x}-\mathbf{b}=0</math> reformulated as a quadratic minimization problem. If the system matrix <math>A</math> is real [[Symmetric matrix|symmetric]] and [[positive-definite matrix|positive-definite]], an objective function is defined as the quadratic function, with minimization of :<math>F(\mathbf{x})=\mathbf{x}^T A\mathbf{x}-2\mathbf{x}^T\mathbf{b},</math> so that :<math>\nabla F(\mathbf{x})=2(A\mathbf{x}-\mathbf{b}).</math> For a general real matrix <math>A</math>, [[linear least squares]] define :<math>F(\mathbf{x})=\left\|A\mathbf{x}-\mathbf{b}\right\|^2.</math> In traditional linear least squares for real <math>A</math> and <math>\mathbf{b}</math> the [[Euclidean norm]] is used, in which case :<math>\nabla F(\mathbf{x})=2A^T(A\mathbf{x}-\mathbf{b}).</math> The [[line search]] minimization, finding the locally optimal step size <math>\gamma</math> on every iteration, can be performed analytically for quadratic functions, and explicit formulas for the locally optimal <math>\gamma</math> are known.<ref name="BP">{{cite book |author-link=Boris T. Polyak |last=Polyak |first=Boris |title=Introduction to Optimization |year=1987 |language=en |url=https://www.researchgate.net/publication/342978480 }}</ref><ref name=saad1996iterative>{{cite book|last=Saad|first=Yousef|title=Iterative methods for sparse linear systems|year=2003|publisher=Society for Industrial and Applied Mathematics|location=Philadelphia, Pa.|isbn=978-0-89871-534-7|pages=[https://archive.org/details/iterativemethods0000saad/page/195 195]|edition=2nd|url=https://archive.org/details/iterativemethods0000saad/page/195}}</ref> For example, for real [[Symmetric matrix|symmetric]] and [[positive-definite matrix|positive-definite]] matrix <math>A</math>, a simple algorithm can be as follows,<ref name="BP" /> :<math>\begin{align} & \text{repeat in the loop:} \\ & \qquad \mathbf{r} := \mathbf{b} - \mathbf{A x} \\ & \qquad \gamma := {\mathbf{r}^\mathsf{T} \mathbf{r}}/{\mathbf{r}^\mathsf{T} \mathbf{A r}} \\ & \qquad \mathbf{x} := \mathbf{x} + \gamma \mathbf{r} \\ & \qquad \hbox{if } \mathbf{r}^\mathsf{T} \mathbf{r} \text{ is sufficiently small, then exit loop} \\ & \text{end repeat loop} \\ & \text{return } \mathbf{x} \text{ as the result} \end{align}</math> To avoid multiplying by <math>A</math> twice per iteration, we note that <math>\mathbf{x} := \mathbf{x} + \gamma \mathbf{r}</math> implies <math>\mathbf{r} := \mathbf{r} - \gamma \mathbf{A r}</math>, which gives the traditional algorithm,<ref name=":0">{{cite journal |first1=Henricus |last1=Bouwmeester |first2=Andrew |last2=Dougherty |first3=Andrew V. |last3=Knyazev |title=Nonsymmetric Preconditioning for Conjugate Gradient and Steepest Descent Methods |journal=Procedia Computer Science |volume=51 |pages=276β285 |year=2015 |doi=10.1016/j.procs.2015.05.241 |doi-access=free |arxiv=1212.6680 }}</ref> :<math>\begin{align} & \mathbf{r} := \mathbf{b} - \mathbf{A x} \\ & \text{repeat in the loop:} \\ & \qquad \gamma := {\mathbf{r}^\mathsf{T} \mathbf{r}}/{\mathbf{r}^\mathsf{T} \mathbf{A r}} \\ & \qquad \mathbf{x} := \mathbf{x} + \gamma \mathbf{r} \\ & \qquad \hbox{if } \mathbf{r}^\mathsf{T} \mathbf{r} \text{ is sufficiently small, then exit loop} \\ & \qquad \mathbf{r} := \mathbf{r} - \gamma \mathbf{A r} \\ & \text{end repeat loop} \\ & \text{return } \mathbf{x} \text{ as the result} \end{align}</math> The method is rarely used for solving linear equations, with the [[conjugate gradient method]] being one of the most popular alternatives. The number of gradient descent iterations is commonly proportional to the spectral [[condition number]] <math>\kappa(A)</math> of the system matrix <math>A</math> (the ratio of the maximum to minimum [[eigenvalues]] of {{nowrap|<math>A^TA</math>)}}, while the convergence of [[conjugate gradient method]] is typically determined by a square root of the condition number, i.e., is much faster. Both methods can benefit from [[Preconditioner|preconditioning]], where gradient descent may require less assumptions on the preconditioner.<ref name=":0" /> === Geometric behavior and residual orthogonality === In steepest descent applied to solving <math> A \vec{x} = \vec{b} </math>, where <math> A </math> is symmetric positive-definite, the residual vectors <math> \vec{r}_k = \vec{b} - A\vec{x}_k </math> are orthogonal across iterations: :<math> \vec{r}_{k+1} \cdot \vec{r}_k = 0. </math> Because each step is taken in the steepest direction, steepest-descent steps alternate between directions aligned with the extreme axes of the elongated level sets. When <math>\kappa(A)</math> is large, this produces a characteristic zig-zag path. The poor conditioning of <math> A </math> is the primary cause of the slow convergence, and orthogonality of successive residuals reinforces this alternation. [[File:Steepest descent convergence path for A = 2 2, 2 3.png|thumb|Convergence path of steepest descent method for A = [[2, 2], [2, 3]]]] As shown in the image on the right, steepest descent converges slowly due to the high condition number of <math> A </math>, and the orthogonality of residuals forces each new direction to undo the overshoot from the previous step. The result is a path that zigzags toward the solution. This inefficiency is one reason conjugate gradient or preconditioning methods are preferred.<ref>{{Cite book | author1=Holmes, M. | title=Introduction to Scientific Computing and Data Analysis, 2nd Ed | year=2023 | publisher=Springer | isbn=978-3-031-22429-4 }}</ref>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)