Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Backpropagation
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Second-order gradient descent== {{Anchor|Second order|Hessian}} Using a [[Hessian matrix]] of second-order derivatives of the error function, the [[Levenberg–Marquardt algorithm]] often converges faster than first-order gradient descent, especially when the topology of the error function is complicated.<ref name="Tan2018">{{cite journal|last1=Tan|first1=Hong Hui|last2=Lim|first2=King Han|title=Review of second-order optimization techniques in artificial neural networks backpropagation|journal=IOP Conference Series: Materials Science and Engineering|year=2019|volume=495|issue=1|page=012003|doi=10.1088/1757-899X/495/1/012003|bibcode=2019MS&E..495a2003T|s2cid=208124487|doi-access=free}}</ref><ref name="Wiliamowski2010">{{cite journal|title=Improved Computation for Levenberg–Marquardt Training|last1=Wiliamowski|first1=Bogdan|last2=Yu|first2=Hao|journal=IEEE Transactions on Neural Networks and Learning Systems|volume=21|issue=6|date=June 2010|url=https://www.eng.auburn.edu/~wilambm/pap/2010/Improved%20Computation%20for%20LM%20Training.pdf}}</ref> It may also find solutions in smaller node counts for which other methods might not converge.<ref name="Wiliamowski2010" /> The Hessian can be approximated by the [[Fisher information]] matrix.<ref name="Martens2020">{{cite journal|last=Martens|first=James|title=New Insights and Perspectives on the Natural Gradient Method|journal=Journal of Machine Learning Research|issue=21|date=August 2020|arxiv=1412.1193}}</ref> As an example, consider a simple feedforward network. At the <math>l</math>-th layer, we have<math display="block">x^{(l)}_i, \quad a^{(l)}_i = f(x^{(l)}_i), \quad x^{(l+1)}_i = \sum_j W_{ij} a^{(l)}_j</math>where <math>x</math> are the pre-activations, <math>a</math> are the activations, and <math>W</math> is the weight matrix. Given a loss function <math>L</math>, the first-order backpropagation states that<math display="block">\frac{\partial L}{\partial a_j^{(l)}} = \sum_j W_{ij}\frac{\partial L}{\partial x_i^{(l+1)}}, \quad \frac{\partial L}{\partial x_j^{(l)}} = f'(x_j^{(l)})\frac{\partial L}{\partial a_j^{(l)}}</math>and the second-order backpropagation states that<math display="block">\frac{\partial^2 L}{\partial a_{j_1}^{(l)}\partial a_{j_2}^{(l)}} = \sum_{j_1j_2} W_{i_1j_1}W_{i_2j_2}\frac{\partial^2 L}{\partial x_{i_1}^{(l+1)}\partial x_{i_2}^{(l+1)}}, \quad \frac{\partial^2 L}{\partial x_{j_1}^{(l)}\partial x_{j_2}^{(l)}} = f'(x_{j_1}^{(l)}) f'(x_{j_2}^{(l)}) \frac{\partial^2 L}{\partial a_{j_1}^{(l)}\partial a_{j_2}^{(l)}} + \delta_{j_1 j_2} f''(x^{(l)}_{j_1} ) \frac{\partial L}{\partial a_{j_1}^{(l)}}</math>where <math>\delta</math> is the [[Dirac delta function|Dirac delta symbol]]. Arbitrary-order derivatives in arbitrary computational graphs can be computed with backpropagation, but with more complex expressions for higher orders.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)