Editing Backpropagation (section)

==History==
{{See also|Perceptron#History|label 1=History of Perceptron}}

=== Precursors ===
Backpropagation had been derived repeatedly, as it is essentially an efficient application of the [[chain rule]] (first written down by [[Gottfried Wilhelm Leibniz]] in 1676)<ref name="leibniz1676">{{Cite book |last=Leibniz |first=Gottfried Wilhelm Freiherr von |authorlink=Gottfried Wilhelm Leibniz|url=https://books.google.com/books?id=bOIGAAAAYAAJ&q=leibniz+altered+manuscripts&pg=PA90 |title=The Early Mathematical Manuscripts of Leibniz: Translated from the Latin Texts Published by Carl Immanuel Gerhardt with Critical and Historical Notes (Leibniz published the chain rule in a 1676 memoir) |date=1920 |publisher=Open court publishing Company |isbn=9780598818461 |language=en}}</ref><ref>{{cite journal |last1=Rodríguez |first1=Omar Hernández |last2=López Fernández |first2=Jorge M. |year=2010 |title=A Semiotic Reflection on the Didactics of the Chain Rule |url=https://scholarworks.umt.edu/tme/vol7/iss2/10/ |journal=The Mathematics Enthusiast |volume=7 |issue=2 |pages=321–332 |doi=10.54870/1551-3440.1191 |s2cid=29739148 |access-date=2019-08-04 |doi-access=free}}</ref> to neural networks.

The terminology "back-propagating error correction" was introduced in 1962 by [[Frank Rosenblatt]], but he did not know how to implement this.<ref>{{cite book |last=Rosenblatt |first=Frank |title=Principles of Neurodynamics |publisher=Spartan, New York |year=1962 |pages=287–298 |author-link=Frank Rosenblatt}}</ref> In any case, he only studied neurons whose outputs were discrete levels, which only had zero derivatives, making backpropagation impossible.

Precursors to backpropagation appeared in [[Optimal control|optimal control theory]] since 1950s. [[Yann LeCun]] et al credits 1950s work by [[Lev Pontryagin|Pontryagin]] and others in optimal control theory, especially the [[adjoint state method]], for being a continuous-time version of backpropagation.<ref>LeCun, Yann, et al. "A theoretical framework for back-propagation." ''Proceedings of the 1988 connectionist models summer school''. Vol. 1. 1988.</ref> [[Robert Hecht-Nielsen|Hecht-Nielsen]]<ref>{{Cite book |last=Hecht-Nielsen |first=Robert |url=http://archive.org/details/neurocomputing0000hech |title=Neurocomputing |date=1990 |publisher=Reading, Mass. : Addison-Wesley Pub. Co. |others=Internet Archive |isbn=978-0-201-09355-1 |pages=124–125}}</ref> credits the [[Stochastic approximation|Robbins–Monro algorithm]] (1951)<ref name="robbins1951">{{Cite journal |last1=Robbins |first1=H. |author-link=Herbert Robbins |last2=Monro |first2=S. |year=1951 |title=A Stochastic Approximation Method |journal=The Annals of Mathematical Statistics |volume=22 |issue=3 |pages=400 |doi=10.1214/aoms/1177729586 |doi-access=free}}</ref> and [[Arthur E. Bryson|Arthur Bryson]] and [[Yu-Chi Ho]]'s ''Applied Optimal Control'' (1969) as presages of backpropagation. Other precursors were [[Henry J. Kelley]] 1960,<ref name="kelley1960" /> and [[Arthur E. Bryson]] (1961).<ref name="bryson1961" /> In 1962, [[Stuart Dreyfus]] published a simpler derivation based only on the [[chain rule]].<ref>{{Cite journal |last=Dreyfus |first=Stuart |year=1962 |title=The numerical solution of variational problems |journal=Journal of Mathematical Analysis and Applications |volume=5 |issue=1 |pages=30–45 |doi=10.1016/0022-247x(62)90004-5 |doi-access=free}}</ref><ref name="dreyfus1990">{{Cite journal |last=Dreyfus |first=Stuart E. |author-link=Stuart Dreyfus |date=1990 |title=Artificial Neural Networks, Back Propagation, and the Kelley-Bryson Gradient Procedure |journal=Journal of Guidance, Control, and Dynamics |volume=13 |issue=5 |pages=926–928 |bibcode=1990JGCD...13..926D |doi=10.2514/3.25422}}</ref><ref>{{Cite web |last1=Mizutani |first1=Eiji |last2=Dreyfus |first2=Stuart |last3=Nishio |first3=Kenichi |date=July 2000 |title=On derivation of MLP backpropagation from the Kelley-Bryson optimal-control gradient formula and its application |url=https://coeieor.wpengine.com/wp-content/uploads/2019/03/ijcnn2k.pdf |publisher=Proceedings of the IEEE International Joint Conference on Neural Networks}}</ref> In 1973, he adapted [[parameter]]s of controllers in proportion to error gradients.<ref name="dreyfus1973">{{cite journal |last=Dreyfus |first=Stuart |author-link=Stuart Dreyfus |year=1973 |title=The computational solution of optimal control problems with time lag |journal=IEEE Transactions on Automatic Control |volume=18 |issue=4 |pages=383–385 |doi=10.1109/tac.1973.1100330}}</ref> Unlike modern backpropagation, these precursors used standard Jacobian matrix calculations from one stage to the previous one, neither addressing direct links across several stages nor potential additional efficiency gains due to network sparsity.<ref name="DLhistory">{{cite arXiv |eprint=2212.11279 |class=cs.NE |first=Jürgen |last=Schmidhuber |author-link=Jürgen Schmidhuber |title=Annotated History of Modern AI and Deep Learning |date=2022}}</ref>

The [[ADALINE]] (1960) learning algorithm was gradient descent with a squared error loss for a single layer. The first [[multilayer perceptron]] (MLP) with more than one layer trained by [[stochastic gradient descent]]<ref name="robbins1951" /> was published in 1967 by [[Shun'ichi Amari]].<ref name="Amari1967">{{cite journal |last1=Amari |first1=Shun'ichi |author-link=Shun'ichi Amari |date=1967 |title=A theory of adaptive pattern classifier |journal=IEEE Transactions |volume=EC |issue=16 |pages=279–307}}</ref> The MLP had 5 layers, with 2 learnable layers, and it learned to classify patterns not linearly separable.<ref name="DLhistory" />

=== Modern backpropagation ===
Modern backpropagation was first published by [[Seppo Linnainmaa]] as "reverse mode of [[automatic differentiation]]" (1970)<ref name="lin1970">{{cite thesis |first=Seppo |last=Linnainmaa |author-link=Seppo Linnainmaa |year=1970 |type=Masters |title=The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors |language=fi |publisher=University of Helsinki |pages=6–7}}</ref> for discrete connected networks of nested [[Differentiable function|differentiable]] functions.<ref name="lin1976">{{cite journal |last1=Linnainmaa |first1=Seppo |author-link=Seppo Linnainmaa |year=1976 |title=Taylor expansion of the accumulated rounding error |journal=BIT Numerical Mathematics |volume=16 |issue=2 |pages=146–160 |doi=10.1007/bf01931367 |s2cid=122357351}}</ref><ref name="grie2012">{{cite book |last=Griewank |first=Andreas |title=Optimization Stories |year=2012 |series=Documenta Mathematica, Extra Volume ISMP |pages=389–400 |chapter=Who Invented the Reverse Mode of Differentiation? |s2cid=15568746}}</ref><ref name="grie2008">{{cite book |last1=Griewank |first1=Andreas |url={{google books |plainurl=y |id=xoiiLaRxcbEC}} |title=Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, Second Edition |last2=Walther |first2=Andrea |author2-link=Andrea Walther |publisher=SIAM |year=2008 |isbn=978-0-89871-776-1}}</ref>

In 1982, [[Paul Werbos]] applied backpropagation to MLPs in the way that has become standard.<ref name="werbos1982">{{Cite book|title=System modeling and optimization|last=Werbos|first=Paul|publisher=Springer|year=1982|pages=762–770|chapter=Applications of advances in nonlinear sensitivity analysis|author-link=Paul Werbos|chapter-url=http://werbos.com/Neural/SensitivityIFIPSeptember1981.pdf|access-date=2 July 2017|archive-date=14 April 2016|archive-url=https://web.archive.org/web/20160414055503/http://werbos.com/Neural/SensitivityIFIPSeptember1981.pdf|url-status=live}}</ref><ref name="werbos1974">{{cite book |last=Werbos |first=Paul J. |title=The Roots of Backpropagation : From Ordered Derivatives to Neural Networks and Political Forecasting |location=New York |publisher=John Wiley & Sons |year=1994 |isbn=0-471-59897-6 }}</ref> Werbos described how he developed backpropagation in an interview. In 1971, during his PhD work, he developed backpropagation to mathematicize [[Sigmund Freud|Freud]]'s "flow of psychic energy". He faced repeated difficulty in publishing the work, only managing in 1981.<ref name=":1">{{Cite book |url=https://direct.mit.edu/books/book/4886/Talking-NetsAn-Oral-History-of-Neural-Networks |title=Talking Nets: An Oral History of Neural Networks |date=2000 |publisher=The MIT Press |isbn=978-0-262-26715-1 |editor-last=Anderson |editor-first=James A. |language=en |doi=10.7551/mitpress/6626.003.0016 |editor-last2=Rosenfeld |editor-first2=Edward}}</ref> He also claimed that "the first practical application of back-propagation was for estimating a dynamic model to predict nationalism and social communications in 1974" by him.<ref>P. J. Werbos, "Backpropagation through time: what it does and how to do it," in Proceedings of the IEEE, vol. 78, no. 10, pp. 1550-1560, Oct. 1990, {{doi|10.1109/5.58337}}</ref>

Around 1982,<ref name=":1" />{{rp|376}} [[David E. Rumelhart]] independently developed<ref>Olazaran Rodriguez, Jose Miguel. ''[https://web.archive.org/web/20221111165150/https://era.ed.ac.uk/bitstream/handle/1842/20075/Olazaran-RodriguezJM_1991redux.pdf?sequence=1&isAllowed=y A historical sociology of neural network research]''. PhD Dissertation. University of Edinburgh, 1991.</ref>{{rp|252}} backpropagation and taught the algorithm to others in his research circle. He did not cite previous work as he was unaware of them. He published the algorithm first in a 1985 paper, then in a 1986 ''[[Nature (journal)|Nature]]'' paper an experimental analysis of the technique.<ref name="learning-representations">{{cite journal | last1 = Rumelhart | last2 = Hinton | last3 = Williams | title=Learning representations by back-propagating errors | journal = Nature | volume = 323 | issue = 6088 | pages = 533–536 | url = http://www.cs.toronto.edu/~hinton/absps/naturebp.pdf| doi = 10.1038/323533a0 | year = 1986 | bibcode = 1986Natur.323..533R | s2cid = 205001834 }}</ref> These papers became highly cited, contributed to the popularization of backpropagation, and coincided with the resurging research interest in neural networks during the 1980s.<ref name="RumelhartHintonWilliams1986a" /><ref name="RumelhartHintonWilliams1986b">{{cite book |editor1-last=Rumelhart |editor1-first=David E. |editor1-link=David E. Rumelhart |editor2-first=James L. |editor2-last=McClelland |editor2-link=James McClelland (psychologist) |title=Parallel Distributed Processing : Explorations in the Microstructure of Cognition |volume=1 : Foundations |last1=Rumelhart |first1=David E. |author-link1=David E. Rumelhart |last2=Hinton |first2=Geoffrey E. |author-link2=Geoffrey E. Hinton |first3=Ronald J. |last3=Williams |author-link3=Ronald J. Williams |chapter=8. Learning Internal Representations by Error Propagation |location=Cambridge |publisher=MIT Press |year=1986b |isbn=0-262-18120-7 |chapter-url-access=registration |chapter-url=https://archive.org/details/paralleldistribu00rume }}</ref><ref>{{cite book|url={{google books |plainurl=y |id=4j9GAQAAIAAJ}}|title=Introduction to Machine Learning|last=Alpaydin|first=Ethem|publisher=MIT Press|year=2010|isbn=978-0-262-01243-0}}</ref>

In 1985, the method was also described by David Parker.<ref>{{Cite report |last=Parker |first=D.B. |date=1985 |title=Learning Logic: Casting the Cortex of the Human Brain in Silicon |department=Center for Computational Research in Economics and Management Science |location=Cambridge MA |id=Technical Report TR-47 |publisher=Massachusetts Institute of Technology}}</ref><ref name=":0">{{Cite book |last=Hertz |first=John |title=Introduction to the theory of neural computation |date=1991 |publisher=Addison-Wesley |others=Krogh, Anders., Palmer, Richard G. |isbn=0-201-50395-6 |location=Redwood City, Calif. |pages=8 |oclc=21522159}}</ref> [[Yann LeCun]] proposed an alternative form of backpropagation for neural networks in his PhD thesis in 1987.<ref>{{Cite thesis |title=Modèles connexionnistes de l'apprentissage |url=https://www.sudoc.fr/043586643 |publisher=Université Pierre et Marie Curie |date=1987 |place=Paris, France |degree=Thèse de doctorat d'état |first=Yann |last=Le Cun}}</ref>

Gradient descent took a considerable amount of time to reach acceptance. Some early objections were: there were no guarantees that gradient descent could reach a global minimum, only local minimum; neurons were "known" by physiologists as making discrete signals (0/1), not continuous ones, and with discrete signals, there is no gradient to take. See the interview with [[Geoffrey Hinton]],<ref name=":1" /> who was awarded the 2024 [[Nobel Prize in Physics]] for his contributions to the field.<ref>{{Cite web |title=The Nobel Prize in Physics 2024 |url=https://www.nobelprize.org/prizes/physics/2024/press-release/ |access-date=2024-10-13 |website=NobelPrize.org |language=en-US}}</ref>

=== Early successes ===
Contributing to the acceptance were several applications in training neural networks via backpropagation, sometimes achieving popularity outside the research circles.

In 1987, [[NETtalk (artificial neural network)|NETtalk]] learned to convert English text into pronunciation. Sejnowski tried training it with both backpropagation and Boltzmann machine, but found the backpropagation significantly faster, so he used it for the final NETtalk.<ref name=":1" />{{rp|p=324}} The NETtalk program became a popular success, appearing on the [[Today (American TV program)|''Today'' show]].<ref name=":02">{{Cite book |last=Sejnowski |first=Terrence J. |title=The deep learning revolution |date=2018 |publisher=The MIT Press |isbn=978-0-262-03803-4 |location=Cambridge, Massachusetts London, England}}</ref>

In 1989, Dean A. Pomerleau published ALVINN, a neural network trained to [[Vehicular automation|drive autonomously]] using backpropagation.<ref>{{Cite journal |last=Pomerleau |first=Dean A. |date=1988 |title=ALVINN: An Autonomous Land Vehicle in a Neural Network |url=https://proceedings.neurips.cc/paper/1988/hash/812b4ba287f5ee0bc9d43bbf5bbe87fb-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Morgan-Kaufmann |volume=1}}</ref>

The [[LeNet]] was published in 1989 to recognize handwritten zip codes.

In 1992, [[TD-Gammon]] achieved top human level play in backgammon. It was a reinforcement learning agent with a neural network with two layers, trained by backpropagation.<ref>{{cite book |last1=Sutton |first1=Richard S. |last2=Barto |first2=Andrew G. |title=Reinforcement Learning: An Introduction |edition=2nd |publisher=MIT Press |place=Cambridge, MA |year=2018 |chapter=11.1 TD-Gammon |chapter-url=http://www.incompleteideas.net/book/11/node2.html}}</ref>

In 1993, Eric Wan won an international pattern recognition contest through backpropagation.<ref name="schmidhuber2015">{{cite journal |last=Schmidhuber |first=Jürgen |author-link=Jürgen Schmidhuber |year=2015 |title=Deep learning in neural networks: An overview |journal=Neural Networks |volume=61 |pages=85–117 |arxiv=1404.7828 |doi=10.1016/j.neunet.2014.09.003 |pmid=25462637 |s2cid=11715509}}</ref><ref>{{cite book |last=Wan |first=Eric A. |title=Time Series Prediction : Forecasting the Future and Understanding the Past |publisher=Addison-Wesley |year=1994 |isbn=0-201-62601-2 |editor-last=Weigend |editor-first=Andreas S. |editor-link=Andreas Weigend |series=Proceedings of the NATO Advanced Research Workshop on Comparative Time Series Analysis |volume=15 |location=Reading |pages=195–217 |chapter=Time Series Prediction by Using a Connectionist Network with Internal Delay Lines |editor2-last=Gershenfeld |editor2-first=Neil A. |editor2-link=Neil Gershenfeld |s2cid=12652643}}</ref>

=== After backpropagation ===
During the 2000s it fell out of favour{{citation needed|date=February 2022}}, but returned in the 2010s, benefiting from cheap, powerful [[GPU]]-based computing systems. This has been especially so in [[speech recognition]], [[machine vision]], [[natural language processing]], and language structure learning research (in which it has been used to explain a variety of phenomena related to first<ref>{{Cite journal|last1=Chang|first1=Franklin|last2=Dell|first2=Gary S.|last3=Bock|first3=Kathryn|date=2006|title=Becoming syntactic.|journal=Psychological Review|volume=113|issue=2|pages=234–272|doi=10.1037/0033-295x.113.2.234|pmid=16637761}}</ref> and second language learning.<ref>{{Cite journal|last1=Janciauskas|first1=Marius|last2=Chang|first2=Franklin|title=Input and Age-Dependent Variation in Second Language Learning: A Connectionist Account|journal=Cognitive Science|volume=42|pages=519–554|doi=10.1111/cogs.12519|pmid=28744901|pmc=6001481|year=2018|issue=Suppl Suppl 2 }}</ref>)<ref>{{Cite web |title=Decoding the Power of Backpropagation: A Deep Dive into Advanced Neural Network Techniques |url=https://www.janbasktraining.com/tutorials/backpropagation-in-deep-learning |website=janbasktraining.com |date=30 January 2024 |language=en}}</ref>

Error backpropagation has been suggested to explain human brain [[event-related potential]] (ERP) components like the [[N400 (neuroscience)|N400]] and [[P600 (neuroscience)|P600]].<ref>{{Cite journal|last1=Fitz|first1=Hartmut|last2=Chang|first2=Franklin|date=2019|title=Language ERPs reflect learning through prediction error propagation|journal=Cognitive Psychology|language=en|volume=111|pages=15–52|doi=10.1016/j.cogpsych.2019.03.002|pmid=30921626|hdl=21.11116/0000-0003-474D-8|s2cid=85501792|hdl-access=free}}</ref>

In 2023, a backpropagation algorithm was implemented on a [[photonic processor]] by a team at [[Stanford University]].<ref>{{Cite web |title=Photonic Chips Curb AI Training's Energy Appetite - IEEE Spectrum |url=https://spectrum.ieee.org/backpropagation-optical-ai |access-date=2023-05-25 |website=[[IEEE]] |language=en}}</ref>