Editing Backpropagation (section)

=== Precursors ===
Backpropagation had been derived repeatedly, as it is essentially an efficient application of the [[chain rule]] (first written down by [[Gottfried Wilhelm Leibniz]] in 1676)<ref name="leibniz1676">{{Cite book |last=Leibniz |first=Gottfried Wilhelm Freiherr von |authorlink=Gottfried Wilhelm Leibniz|url=https://books.google.com/books?id=bOIGAAAAYAAJ&q=leibniz+altered+manuscripts&pg=PA90 |title=The Early Mathematical Manuscripts of Leibniz: Translated from the Latin Texts Published by Carl Immanuel Gerhardt with Critical and Historical Notes (Leibniz published the chain rule in a 1676 memoir) |date=1920 |publisher=Open court publishing Company |isbn=9780598818461 |language=en}}</ref><ref>{{cite journal |last1=Rodríguez |first1=Omar Hernández |last2=López Fernández |first2=Jorge M. |year=2010 |title=A Semiotic Reflection on the Didactics of the Chain Rule |url=https://scholarworks.umt.edu/tme/vol7/iss2/10/ |journal=The Mathematics Enthusiast |volume=7 |issue=2 |pages=321–332 |doi=10.54870/1551-3440.1191 |s2cid=29739148 |access-date=2019-08-04 |doi-access=free}}</ref> to neural networks.

The terminology "back-propagating error correction" was introduced in 1962 by [[Frank Rosenblatt]], but he did not know how to implement this.<ref>{{cite book |last=Rosenblatt |first=Frank |title=Principles of Neurodynamics |publisher=Spartan, New York |year=1962 |pages=287–298 |author-link=Frank Rosenblatt}}</ref> In any case, he only studied neurons whose outputs were discrete levels, which only had zero derivatives, making backpropagation impossible.

Precursors to backpropagation appeared in [[Optimal control|optimal control theory]] since 1950s. [[Yann LeCun]] et al credits 1950s work by [[Lev Pontryagin|Pontryagin]] and others in optimal control theory, especially the [[adjoint state method]], for being a continuous-time version of backpropagation.<ref>LeCun, Yann, et al. "A theoretical framework for back-propagation." ''Proceedings of the 1988 connectionist models summer school''. Vol. 1. 1988.</ref> [[Robert Hecht-Nielsen|Hecht-Nielsen]]<ref>{{Cite book |last=Hecht-Nielsen |first=Robert |url=http://archive.org/details/neurocomputing0000hech |title=Neurocomputing |date=1990 |publisher=Reading, Mass. : Addison-Wesley Pub. Co. |others=Internet Archive |isbn=978-0-201-09355-1 |pages=124–125}}</ref> credits the [[Stochastic approximation|Robbins–Monro algorithm]] (1951)<ref name="robbins1951">{{Cite journal |last1=Robbins |first1=H. |author-link=Herbert Robbins |last2=Monro |first2=S. |year=1951 |title=A Stochastic Approximation Method |journal=The Annals of Mathematical Statistics |volume=22 |issue=3 |pages=400 |doi=10.1214/aoms/1177729586 |doi-access=free}}</ref> and [[Arthur E. Bryson|Arthur Bryson]] and [[Yu-Chi Ho]]'s ''Applied Optimal Control'' (1969) as presages of backpropagation. Other precursors were [[Henry J. Kelley]] 1960,<ref name="kelley1960" /> and [[Arthur E. Bryson]] (1961).<ref name="bryson1961" /> In 1962, [[Stuart Dreyfus]] published a simpler derivation based only on the [[chain rule]].<ref>{{Cite journal |last=Dreyfus |first=Stuart |year=1962 |title=The numerical solution of variational problems |journal=Journal of Mathematical Analysis and Applications |volume=5 |issue=1 |pages=30–45 |doi=10.1016/0022-247x(62)90004-5 |doi-access=free}}</ref><ref name="dreyfus1990">{{Cite journal |last=Dreyfus |first=Stuart E. |author-link=Stuart Dreyfus |date=1990 |title=Artificial Neural Networks, Back Propagation, and the Kelley-Bryson Gradient Procedure |journal=Journal of Guidance, Control, and Dynamics |volume=13 |issue=5 |pages=926–928 |bibcode=1990JGCD...13..926D |doi=10.2514/3.25422}}</ref><ref>{{Cite web |last1=Mizutani |first1=Eiji |last2=Dreyfus |first2=Stuart |last3=Nishio |first3=Kenichi |date=July 2000 |title=On derivation of MLP backpropagation from the Kelley-Bryson optimal-control gradient formula and its application |url=https://coeieor.wpengine.com/wp-content/uploads/2019/03/ijcnn2k.pdf |publisher=Proceedings of the IEEE International Joint Conference on Neural Networks}}</ref> In 1973, he adapted [[parameter]]s of controllers in proportion to error gradients.<ref name="dreyfus1973">{{cite journal |last=Dreyfus |first=Stuart |author-link=Stuart Dreyfus |year=1973 |title=The computational solution of optimal control problems with time lag |journal=IEEE Transactions on Automatic Control |volume=18 |issue=4 |pages=383–385 |doi=10.1109/tac.1973.1100330}}</ref> Unlike modern backpropagation, these precursors used standard Jacobian matrix calculations from one stage to the previous one, neither addressing direct links across several stages nor potential additional efficiency gains due to network sparsity.<ref name="DLhistory">{{cite arXiv |eprint=2212.11279 |class=cs.NE |first=Jürgen |last=Schmidhuber |author-link=Jürgen Schmidhuber |title=Annotated History of Modern AI and Deep Learning |date=2022}}</ref>

The [[ADALINE]] (1960) learning algorithm was gradient descent with a squared error loss for a single layer. The first [[multilayer perceptron]] (MLP) with more than one layer trained by [[stochastic gradient descent]]<ref name="robbins1951" /> was published in 1967 by [[Shun'ichi Amari]].<ref name="Amari1967">{{cite journal |last1=Amari |first1=Shun'ichi |author-link=Shun'ichi Amari |date=1967 |title=A theory of adaptive pattern classifier |journal=IEEE Transactions |volume=EC |issue=16 |pages=279–307}}</ref> The MLP had 5 layers, with 2 learnable layers, and it learned to classify patterns not linearly separable.<ref name="DLhistory" />