Editing Conjugate gradient method (section)

===The resulting algorithm===
The above algorithm gives the most straightforward explanation of the conjugate gradient method.  Seemingly, the algorithm as stated requires storage of all previous searching directions and residue vectors, as well as many matrix–vector multiplications, and thus can be computationally expensive. However, a closer analysis of the algorithm shows that <math>\mathbf{r}_i</math> is orthogonal to <math>\mathbf{r}_j</math>, i.e. <math>\mathbf{r}_i^\mathsf{T} \mathbf{r}_j=0 </math>, for <math>i \neq j</math>. And <math>\mathbf{p}_i</math> is <math>\mathbf{A}</math>-orthogonal to <math>\mathbf{p}_j</math>,   i.e. <math>\mathbf{p}_i^\mathsf{T} \mathbf{A} \mathbf{p}_j=0 </math>, for <math>i \neq j</math>. This can be regarded that as the algorithm progresses, <math>\mathbf{p}_i</math> and <math>\mathbf{r}_i</math> span the same [[Krylov subspace]], where <math>\mathbf{r}_i</math> form the orthogonal basis with respect to the standard inner product, and <math>\mathbf{p}_i</math> form the orthogonal basis with respect to the inner product induced by <math>\mathbf{A}</math>. Therefore, <math>\mathbf{x}_k</math> can be regarded as the projection of <math>\mathbf{x}</math> on the Krylov subspace.

That is, if the CG method starts with <math>\mathbf{x}_0 = 0</math>, then<ref>{{Cite journal |last1=Paquette |first1=Elliot |last2=Trogdon |first2=Thomas |date=March 2023 |title=Universality for the Conjugate Gradient and MINRES Algorithms on Sample Covariance Matrices |url=https://onlinelibrary.wiley.com/doi/10.1002/cpa.22081 |journal=Communications on Pure and Applied Mathematics |language=en |volume=76 |issue=5 |pages=1085–1136 |doi=10.1002/cpa.22081 |issn=0010-3640|arxiv=2007.00640 }}</ref><math display="block">x_k =
\mathrm{argmin}_{y \in \mathbb{R}^n}
{\left\{(x-y)^{\top} A(x-y): y \in \operatorname{span}\left\{b, A b, \ldots, A^{k-1} b\right\}\right\}}</math>The algorithm is detailed below for solving <math>\mathbf{A} \mathbf{x}= \mathbf{b}</math> where <math>\mathbf{A}</math> is a real, symmetric, positive-definite matrix. The input vector <math>\mathbf{x}_0</math> can be an approximate initial solution or <math>\mathbf{0}</math>. It is a different formulation of the exact procedure described above.

:<math>\begin{align}
& \mathbf{r}_0 := \mathbf{b} - \mathbf{A x}_0 \\
& \hbox{if } \mathbf{r}_{0} \text{ is sufficiently small, then return } \mathbf{x}_{0} \text{ as the result}\\
& \mathbf{p}_0 := \mathbf{r}_0 \\
& k := 0 \\
& \text{repeat} \\
& \qquad \alpha_k := \frac{\mathbf{r}_k^\mathsf{T} \mathbf{r}_k}{\mathbf{p}_k^\mathsf{T} \mathbf{A p}_k}  \\
& \qquad \mathbf{x}_{k+1} := \mathbf{x}_k + \alpha_k \mathbf{p}_k \\
& \qquad \mathbf{r}_{k+1} := \mathbf{r}_k - \alpha_k \mathbf{A p}_k \\
& \qquad \hbox{if } \mathbf{r}_{k+1} \text{ is sufficiently small, then exit loop} \\
& \qquad \beta_k := \frac{\mathbf{r}_{k+1}^\mathsf{T} \mathbf{r}_{k+1}}{\mathbf{r}_k^\mathsf{T} \mathbf{r}_k} \\
& \qquad \mathbf{p}_{k+1} := \mathbf{r}_{k+1} + \beta_k \mathbf{p}_k \\
& \qquad k := k + 1 \\
& \text{end repeat} \\
& \text{return } \mathbf{x}_{k+1} \text{ as the result}
\end{align}</math>

This is the most commonly used algorithm. The same formula for <math>\beta_k</math> is also used in the Fletcher–Reeves [[nonlinear conjugate gradient method]].

====Restarts====
We note that <math>\mathbf{x}_{1}</math> is computed by the [[Gradient descent#Solution of a linear system|gradient descent]] method applied to <math>\mathbf{x}_{0}</math>. Setting <math>\beta_{k}=0</math> would similarly make <math>\mathbf{x}_{k+1}</math> computed by the [[Gradient descent#Solution of a linear system|gradient descent]] method from <math>\mathbf{x}_{k}</math>, i.e., can be used as a simple implementation of a restart of the conjugate gradient iterations.<ref name="BP" /> Restarts could slow down convergence, but may improve stability if the conjugate gradient method misbehaves, e.g., due to [[round-off error]].

====Explicit residual calculation====
The formulas <math>\mathbf{x}_{k+1} := \mathbf{x}_k + \alpha_k \mathbf{p}_k</math> and <math>\mathbf{r}_k := \mathbf{b} - \mathbf{A x}_k</math>, which both hold in exact arithmetic, make the formulas <math>\mathbf{r}_{k+1} := \mathbf{r}_k - \alpha_k \mathbf{A p}_k</math> and <math>\mathbf{r}_{k+1} := \mathbf{b} - \mathbf{A x}_{k+1}</math> mathematically equivalent. The former is used in the algorithm to avoid an extra multiplication by <math>\mathbf{A}</math> since the vector <math>\mathbf{A p}_k</math> is already computed to evaluate <math>\alpha_k</math>. The latter may be more accurate, substituting the explicit calculation <math>\mathbf{r}_{k+1} := \mathbf{b} - \mathbf{A x}_{k+1}</math> for the implicit one by the recursion subject to [[round-off error]] accumulation, and is thus recommended for an occasional evaluation.<ref>{{cite book | first=Jonathan R | last=Shewchuk |title=An Introduction to the Conjugate Gradient Method Without the Agonizing Pain |year=1994 |url=http://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf  }}</ref>

A norm of the residual is typically used for stopping criteria. The norm of the explicit residual <math>\mathbf{r}_{k+1} := \mathbf{b} - \mathbf{A x}_{k+1}</math> provides a guaranteed level of accuracy both in exact arithmetic and in the presence of the [[rounding errors]], where convergence naturally stagnates. In contrast, the implicit residual <math>\mathbf{r}_{k+1} := \mathbf{r}_k - \alpha_k \mathbf{A p}_k</math> is known to keep getting smaller in amplitude well below the level of [[rounding errors]] and thus cannot be used to determine the stagnation of convergence.

====Computation of alpha and beta====
In the algorithm, <math>\alpha_k</math> is chosen such that <math>\mathbf{r}_{k+1}</math> is orthogonal to <math>\mathbf{r}_{k}</math>. The denominator is simplified from

:<math>\alpha_k = \frac{\mathbf{r}_{k}^\mathsf{T} \mathbf{r}_{k}}{\mathbf{r}_{k}^\mathsf{T} \mathbf{A} \mathbf{p}_k} = \frac{\mathbf{r}_k^\mathsf{T} \mathbf{r}_k}{\mathbf{p}_k^\mathsf{T} \mathbf{A p}_k} </math>

since <math>\mathbf{r}_{k+1} = \mathbf{p}_{k+1}-\mathbf{\beta}_{k}\mathbf{p}_{k}</math>. The <math>\beta_k</math> is chosen such that <math>\mathbf{p}_{k+1}</math> is conjugate to <math>\mathbf{p}_{k}</math>. Initially, <math>\beta_k</math> is

:<math>\beta_k = - \frac{\mathbf{r}_{k+1}^\mathsf{T} \mathbf{A} \mathbf{p}_k}{\mathbf{p}_k^\mathsf{T} \mathbf{A} \mathbf{p}_k}</math>

using

:<math>\mathbf{r}_{k+1} = \mathbf{r}_{k} - \alpha_{k} \mathbf{A} \mathbf{p}_{k}</math>

and equivalently

<math> \mathbf{A} \mathbf{p}_{k} = \frac{1}{\alpha_{k}} (\mathbf{r}_{k} - \mathbf{r}_{k+1}), </math>

the numerator of <math>\beta_k</math> is rewritten as

:<math> \mathbf{r}_{k+1}^\mathsf{T} \mathbf{A} \mathbf{p}_k = \frac{1}{\alpha_k} \mathbf{r}_{k+1}^\mathsf{T} (\mathbf{r}_k - \mathbf{r}_{k+1}) = - \frac{1}{\alpha_k} \mathbf{r}_{k+1}^\mathsf{T} \mathbf{r}_{k+1} </math>

because <math>\mathbf{r}_{k+1}</math> and <math>\mathbf{r}_{k}</math> are orthogonal by design. The denominator is rewritten as
 
:<math> \mathbf{p}_k^\mathsf{T} \mathbf{A} \mathbf{p}_k = (\mathbf{r}_k + \beta_{k-1} \mathbf{p}_{k-1})^\mathsf{T} \mathbf{A} \mathbf{p}_k = \frac{1}{\alpha_k} \mathbf{r}_k^\mathsf{T} (\mathbf{r}_k - \mathbf{r}_{k+1}) = \frac{1}{\alpha_k} \mathbf{r}_k^\mathsf{T} \mathbf{r}_k </math>

using that the search directions <math>\mathbf{p}_k</math> are conjugated and again that the residuals are orthogonal. This gives the <math>\beta</math> in the algorithm after cancelling <math>\alpha_k</math>.

====Example code in [[Julia (programming language)]]====
<syntaxhighlight lang="julia" line="1" start="1">
"""
    conjugate_gradient!(A, b, x)

Return the solution to `A * x = b` using the conjugate gradient method.
"""
function conjugate_gradient!(
    A::AbstractMatrix, b::AbstractVector, x::AbstractVector; tol=eps(eltype(b))
)
    # Initialize residual vector
    residual = b - A * x
    # Initialize search direction vector
    search_direction = copy(residual)
    # Compute initial squared residual norm
	norm(x) = sqrt(sum(x.^2))
    old_resid_norm = norm(residual)

    # Iterate until convergence
    while old_resid_norm > tol
        A_search_direction = A * search_direction
        step_size = old_resid_norm^2 / (search_direction' * A_search_direction)
        # Update solution
        @. x = x + step_size * search_direction
        # Update residual
        @. residual = residual - step_size * A_search_direction
        new_resid_norm = norm(residual)
        
        # Update search direction vector
        @. search_direction = residual + 
            (new_resid_norm / old_resid_norm)^2 * search_direction
        # Update squared residual norm for next iteration
        old_resid_norm = new_resid_norm
    end
    return x
end

</syntaxhighlight>

====Example code in [[MATLAB]]====
<syntaxhighlight lang="matlab" line="1" start="1">
function x = conjugate_gradient(A, b, x0, tol)
% Return the solution to `A * x = b` using the conjugate gradient method.
% Reminder: A should be symmetric and positive definite.

    if nargin < 4
        tol = eps;
    end

    r = b - A * x0;
    p = r;
    rsold = r' * r;

    x = x0;

    while sqrt(rsold) > tol
        Ap = A * p;
        alpha = rsold / (p' * Ap);
        x = x + alpha * p;
        r = r - alpha * Ap;
        rsnew = r' * r;
        p = r + (rsnew / rsold) * p;
        rsold = rsnew;
    end
end
</syntaxhighlight>