Editing Levenberg–Marquardt algorithm (section)

=== Choice of damping parameter ===
Various more or less heuristic arguments have been put forward for the best choice for the damping parameter {{tmath|\lambda}}.  Theoretical arguments exist showing why some of these choices guarantee local convergence of the algorithm; however, these choices can make the global convergence of the algorithm suffer from the undesirable properties of [[gradient descent|steepest descent]], in particular, very slow convergence close to the optimum.

The absolute values of any choice depend on how well-scaled the initial problem is.  Marquardt recommended starting with a value {{tmath|\lambda_0}} and a factor {{tmath|\nu > 1}}. Initially setting <math>\lambda = \lambda_0</math> and computing the residual sum of squares <math>S\left (\boldsymbol\beta\right )</math> after one step from the starting point with the damping factor of <math>\lambda = \lambda_0</math> and secondly with {{tmath|\lambda_0 / \nu}}.  If both of these are worse than the initial point, then the damping is increased by successive multiplication by {{tmath|\nu}} until a better point is found with a new damping factor of {{tmath|\lambda_0\nu^k}} for some {{tmath|k}}.

If use of the damping factor {{tmath|\lambda / \nu}} results in a reduction in squared residual, then this is taken as the new value of {{tmath|\lambda}} (and the new optimum location is taken as that obtained with this damping factor) and the process continues; if using {{tmath|\lambda / \nu}} resulted in a worse residual, but using {{tmath|\lambda}} resulted in a better residual, then {{tmath|\lambda}} is left unchanged and the new optimum is taken as the value obtained with {{tmath|\lambda}} as damping factor.

An effective strategy for the control of the damping parameter, called ''delayed gratification'', consists of increasing the parameter by a small amount for each uphill step, and decreasing by a large amount for each downhill step. The idea behind this strategy is to avoid moving downhill too fast in the beginning of optimization, therefore restricting the steps available in future iterations and therefore slowing down convergence.<ref name="Transtrum2011"/> An increase by a factor of 2 and a decrease by a factor of 3 has been shown to be effective in most cases, while for large problems more extreme values can work better, with an increase by a factor of 1.5 and a decrease by a factor of 5.<ref name="Transtrum2012"/>