Editing Gauss–Newton algorithm (section)

== Improved versions ==

With the Gauss–Newton method the sum of squares of the residuals ''S'' may not decrease at every iteration. However, since Δ is a descent direction, unless <math>S\left(\boldsymbol \beta^s\right)</math> is a stationary point, it holds that <math>S\left(\boldsymbol \beta^s + \alpha\Delta\right) < S\left(\boldsymbol \beta^s\right)</math> for all sufficiently small <math>\alpha>0</math>. Thus, if divergence occurs, one solution is to employ a fraction <math>\alpha</math> of the increment vector Δ in the updating formula:
<math display="block"> \boldsymbol \beta^{s+1} = \boldsymbol \beta^s + \alpha \Delta.</math>

In other words, the increment vector is too long, but it still points "downhill", so going just a part of the way will decrease the objective function ''S''. An optimal value for <math>\alpha</math> can be found by using a [[line search]] algorithm, that is, the magnitude of <math>\alpha</math> is determined by finding the value that minimizes ''S'', usually using a [[line search|direct search method]] in the interval <math>0 < \alpha < 1</math> or a [[backtracking line search]] such as [[Backtracking line search|Armijo-line search]]. Typically, <math>\alpha</math> should be chosen such that it satisfies the [[Wolfe conditions]] or the [[Goldstein conditions]].<ref>{{Cite book|title=Numerical optimization|last=Nocedal, Jorge. | date=1999 | publisher=Springer|others=Wright, Stephen J., 1960-|isbn=0387227423|location=New York|oclc=54849297}}</ref>

In cases where the direction of the shift vector is such that the optimal fraction α is close to zero, an alternative method for handling divergence is the use of the [[Levenberg–Marquardt algorithm]], a [[trust region]] method.<ref name="ab"/> The normal equations are modified in such a way that the increment vector is rotated towards the direction of [[steepest descent]],
<math display="block">\left(\mathbf{J^\operatorname{T} J + \lambda D}\right) \Delta = -\mathbf{J}^\operatorname{T} \mathbf{r},</math>

where '''D''' is a positive diagonal matrix. Note that when '''D''' is the identity matrix '''I''' and <math>\lambda \to +\infty</math>, then <math>\lambda \Delta = \lambda \left(\mathbf{J^\operatorname{T} J} + \lambda \mathbf{I}\right)^{-1} \left(-\mathbf{J}^\operatorname{T} \mathbf{r}\right) = \left(\mathbf{I} - \mathbf{J^\operatorname{T} J} / \lambda + \cdots \right) \left(-\mathbf{J}^\operatorname{T} \mathbf{r}\right) \to -\mathbf{J}^\operatorname{T} \mathbf{r}</math>, therefore the [[Direction (geometry, geography)|direction]] of Δ approaches the direction of the negative gradient <math>-\mathbf{J}^\operatorname{T} \mathbf{r}</math>.

The so-called Marquardt parameter <math>\lambda</math> may also be optimized by a line search, but this is inefficient, as the shift vector must be recalculated every time <math>\lambda</math> is changed. A more efficient strategy is this: When divergence occurs, increase the Marquardt parameter until there is a decrease in ''S''. Then retain the value from one iteration to the next, but decrease it if possible until a cut-off value is reached, when the Marquardt parameter can be set to zero; the minimization of ''S'' then becomes a standard Gauss–Newton minimization.