Editing Backpropagation (section)

==Second-order gradient descent==
{{Anchor|Second order|Hessian}}

Using a [[Hessian matrix]] of second-order derivatives of the error function, the [[Levenberg–Marquardt algorithm]] often converges faster than first-order gradient descent, especially when the topology of the error function is complicated.<ref name="Tan2018">{{cite journal|last1=Tan|first1=Hong Hui|last2=Lim|first2=King Han|title=Review of second-order optimization techniques in artificial neural networks backpropagation|journal=IOP Conference Series: Materials Science and Engineering|year=2019|volume=495|issue=1|page=012003|doi=10.1088/1757-899X/495/1/012003|bibcode=2019MS&E..495a2003T|s2cid=208124487|doi-access=free}}</ref><ref name="Wiliamowski2010">{{cite journal|title=Improved Computation for Levenberg–Marquardt Training|last1=Wiliamowski|first1=Bogdan|last2=Yu|first2=Hao|journal=IEEE Transactions on Neural Networks and Learning Systems|volume=21|issue=6|date=June 2010|url=https://www.eng.auburn.edu/~wilambm/pap/2010/Improved%20Computation%20for%20LM%20Training.pdf}}</ref> It may also find solutions in smaller node counts for which other methods might not converge.<ref name="Wiliamowski2010" /> The Hessian can be approximated by the [[Fisher information]] matrix.<ref name="Martens2020">{{cite journal|last=Martens|first=James|title=New Insights and Perspectives on the Natural Gradient Method|journal=Journal of Machine Learning Research|issue=21|date=August 2020|arxiv=1412.1193}}</ref>

As an example, consider a simple feedforward network. At the <math>l</math>-th layer, we have<math display="block">x^{(l)}_i, \quad a^{(l)}_i = f(x^{(l)}_i), \quad x^{(l+1)}_i = \sum_j W_{ij} a^{(l)}_j</math>where <math>x</math> are the pre-activations, <math>a</math> are the activations, and <math>W</math> is the weight matrix. Given a loss function <math>L</math>, the first-order backpropagation states that<math display="block">\frac{\partial L}{\partial a_j^{(l)}} = \sum_j W_{ij}\frac{\partial L}{\partial x_i^{(l+1)}}, \quad 
\frac{\partial L}{\partial x_j^{(l)}} = f'(x_j^{(l)})\frac{\partial L}{\partial a_j^{(l)}}</math>and the second-order backpropagation states that<math display="block">\frac{\partial^2 L}{\partial a_{j_1}^{(l)}\partial a_{j_2}^{(l)}} = \sum_{j_1j_2} W_{i_1j_1}W_{i_2j_2}\frac{\partial^2 L}{\partial x_{i_1}^{(l+1)}\partial x_{i_2}^{(l+1)}}, \quad 
\frac{\partial^2 L}{\partial x_{j_1}^{(l)}\partial x_{j_2}^{(l)}} = f'(x_{j_1}^{(l)}) f'(x_{j_2}^{(l)}) \frac{\partial^2 L}{\partial a_{j_1}^{(l)}\partial a_{j_2}^{(l)}} + \delta_{j_1 j_2} f''(x^{(l)}_{j_1} ) \frac{\partial L}{\partial a_{j_1}^{(l)}}</math>where <math>\delta</math> is the [[Dirac delta function|Dirac delta symbol]].

Arbitrary-order derivatives in arbitrary computational graphs can be computed with backpropagation, but with more complex expressions for higher orders.