Editing Gauss–Newton algorithm (section)

== Derivation from Newton's method ==

In what follows, the Gauss–Newton algorithm will be derived from [[Newton's method in optimization|Newton's method]] for function optimization via an approximation. As a consequence, the rate of convergence of the Gauss–Newton algorithm can be quadratic under certain regularity conditions. In general (under weaker conditions), the convergence rate is linear.<ref>{{Cite web |url=http://www.henley.ac.uk/web/FILES/maths/09-04.pdf |title=Archived copy |access-date=2014-04-25 |archive-url=https://web.archive.org/web/20160804022151/http://www.henley.ac.uk/web/FILES/maths/09-04.pdf |archive-date=2016-08-04 |url-status=dead }}</ref>

The recurrence relation for Newton's method for minimizing a function ''S'' of parameters <math>\boldsymbol\beta</math> is
<math display="block"> \boldsymbol\beta^{(s+1)} = \boldsymbol\beta^{(s)} - \mathbf H^{-1} \mathbf g,</math>

where '''g''' denotes the [[gradient|gradient vector]] of ''S'', and '''H''' denotes the [[Hessian matrix]] of ''S''.

Since <math display="inline">S = \sum_{i=1}^m r_i^2</math>, the gradient is given by
<math display="block">g_j = 2 \sum_{i=1}^m r_i \frac{\partial r_i}{\partial \beta_j}.</math>

Elements of the Hessian are calculated by differentiating the gradient elements, <math>g_j</math>, with respect to <math>\beta_k</math>:
<math display="block">H_{jk} = 2 \sum_{i=1}^m \left(\frac{\partial r_i}{\partial \beta_j} \frac{\partial r_i}{\partial \beta_k} + r_i \frac{\partial^2 r_i}{\partial \beta_j \partial \beta_k}\right).</math>

The Gauss–Newton method is obtained by ignoring the second-order derivative terms (the second term in this expression). That is, the Hessian is approximated by
<math display="block">H_{jk} \approx 2 \sum_{i=1}^m J_{ij} J_{ik},</math>

where <math display="inline">J_{ij} = {\partial r_i}/{\partial \beta_j}</math> are entries of the Jacobian '''J<sub>r</sub>'''. Note that when the exact hessian is evaluated near an exact fit we have near-zero <math>r_i</math>, so the second term becomes near-zero as well, which justifies the approximation. The gradient and the approximate Hessian can be written in matrix notation as
<math display="block">\mathbf{g} = 2 {\mathbf{J}_\mathbf{r}}^\operatorname{T} \mathbf{r}, \quad \mathbf{H} \approx 2 {\mathbf{J}_\mathbf{r}}^\operatorname{T} \mathbf{J_r}.</math>

These expressions are substituted into the recurrence relation above to obtain the operational equations
<math display="block"> \boldsymbol{\beta}^{(s+1)} = \boldsymbol\beta^{(s)} + \Delta; \quad \Delta = -\left(\mathbf{J_r}^\operatorname{T} \mathbf{J_r}\right)^{-1} \mathbf{J_r}^\operatorname{T} \mathbf{r}.</math>

Convergence of the Gauss–Newton method is not guaranteed in all instances. The approximation
<math display="block">\left|r_i \frac{\partial^2 r_i}{\partial \beta_j \partial \beta_k}\right| \ll \left|\frac{\partial r_i}{\partial \beta_j} \frac{\partial r_i}{\partial \beta_k}\right|</math>

that needs to hold to be able to ignore the second-order derivative terms may be valid in two cases, for which convergence is to be expected:<ref>Nocedal (1999), p. 259.</ref>
# The function values <math>r_i</math> are small in magnitude, at least around the minimum. 
# The functions are only "mildly" nonlinear, so that <math display="inline">\frac{\partial^2 r_i}{\partial \beta_j \partial \beta_k}</math> is relatively small in magnitude.