Editing Neural network (machine learning) (section)

=== Recurrent neural networks ===
One origin of RNN was [[statistical mechanics]]. In 1972, [[Shun'ichi Amari]] proposed to modify the weights of an [[Ising model]] by [[Hebbian theory|Hebbian learning]] rule as a model of [[Hopfield network|associative memory]], adding in the component of learning.<ref>{{Cite journal |last=Amari |first=S.-I. |date=November 1972 |title=Learning Patterns and Pattern Sequences by Self-Organizing Nets of Threshold Elements |url=https://ieeexplore.ieee.org/document/1672070 |journal=IEEE Transactions on Computers |volume=C-21 |issue=11 |pages=1197–1206 |doi=10.1109/T-C.1972.223477 |issn=0018-9340 |archive-date=12 October 2024 |access-date=7 August 2024 |archive-url=https://archive.today/20241012222852/https://ieeexplore.ieee.org/document/1672070 |url-status=live }}</ref> This was popularized as the Hopfield network by [[John Hopfield]] (1982).<ref name="Hopfield19822">{{cite journal |last1=Hopfield |first1=J. J. |date=1982 |title=Neural networks and physical systems with emergent collective computational abilities |journal=Proceedings of the National Academy of Sciences |volume=79 |issue=8 |pages=2554–2558 |bibcode=1982PNAS...79.2554H |doi=10.1073/pnas.79.8.2554 |pmc=346238 |pmid=6953413 |doi-access=free}}</ref> Another origin of RNN was neuroscience. The word "recurrent" is used to describe loop-like structures in anatomy. In 1901, [[Santiago Ramón y Cajal|Cajal]] observed "recurrent semicircles" in the [[Cerebellum|cerebellar cortex]].<ref>{{Cite journal |last1=Espinosa-Sanchez |first1=Juan Manuel |last2=Gomez-Marin |first2=Alex |last3=de Castro |first3=Fernando |date=5 July 2023 |title=The Importance of Cajal's and Lorente de Nó's Neuroscience to the Birth of Cybernetics |url=http://journals.sagepub.com/doi/10.1177/10738584231179932 |journal=The Neuroscientist |volume=31 |issue=1 |pages=14–30 |language=en |doi=10.1177/10738584231179932 |issn=1073-8584 |pmid=37403768 |hdl=10261/348372 |hdl-access=free |archive-date=12 October 2024 |access-date=7 August 2024 |archive-url=https://archive.today/20241012221924/http://journals.sagepub.com/doi/10.1177/10738584231179932 |url-status=live }}</ref> [[Donald O. Hebb|Hebb]] considered "reverberating circuit" as an explanation for short-term memory.<ref>{{Cite web |title=reverberating circuit |url=https://www.oxfordreference.com/display/10.1093/oi/authority.20110803100417461 |access-date=27 July 2024 |website=Oxford Reference |archive-date=12 October 2024 |archive-url=https://archive.today/20241012222600/https://www.oxfordreference.com/display/10.1093/oi/authority.20110803100417461 |url-status=live }}</ref> The McCulloch and Pitts paper (1943) considered neural networks that contain cycles, and noted that the current activity of such networks can be affected by activity indefinitely far in the past.<ref name=WM>{{Cite journal |last1=McCulloch |first1=Warren S. |last2=Pitts |first2=Walter |date=December 1943 |title=A logical calculus of the ideas immanent in nervous activity |url=http://link.springer.com/10.1007/BF02478259 |journal=The Bulletin of Mathematical Biophysics |volume=5 |issue=4 |pages=115–133 |doi=10.1007/BF02478259 |issn=0007-4985 |archive-date=12 October 2024 |access-date=7 August 2024 |archive-url=https://archive.today/20241012221923/http://link.springer.com/10.1007/BF02478259 |url-status=live }}</ref>

In 1982 a recurrent neural network with an array architecture (rather than a multilayer perceptron architecture), namely a Crossbar Adaptive Array,<ref name="CAA1982"> Bozinovski, S. (1982). "A self-learning system using secondary reinforcement". In Trappl, Robert (ed.). Cybernetics and Systems Research: Proceedings of the Sixth European Meeting on Cybernetics and Systems Research. North-Holland. pp. 397–402. ISBN 978-0-444-86488-8</ref><ref name="" "caa1995"="">Bozinovski S. (1995) "Neuro genetic agents and structural theory of self-reinforcement learning systems". CMPSCI Technical Report 95-107, University of Massachusetts at Amherst [https://web.cs.umass.edu/publication/docs/1995/UM-CS-1995-107.pdf] {{Webarchive|url=https://web.archive.org/web/20241008120651/https://web.cs.umass.edu/publication/docs/1995/UM-CS-1995-107.pdf |date=8 October 2024 }}</ref> used direct recurrent connections from the output to the supervisor (teaching) inputs. In addition of computing actions (decisions), it computed internal state evaluations (emotions) of the consequence situations. Eliminating the external supervisor, it introduced the self-learning method in neural networks.  

In cognitive psychology, the journal American Psychologist in early 1980's carried out a debate on the relation between cognition and emotion. Zajonc in 1980 stated that emotion is computed first and is independent from cognition, while Lazarus in 1982 stated that cognition is computed first and is inseparable from emotion.<ref>R. Zajonc (1980) "Feeling and thinking: Preferences need no inferences". American Psychologist 35 (2): 151-175</ref><ref>Lazarus R. (1982) "Thoughts on the relations between emotion and cognition" American Psychologist 37 (9): 1019-1024</ref> In 1982 the Crossbar Adaptive Array gave a neural network model of cognition-emotion relation.<ref name = "CAA1982" /><ref>Bozinovski, S. (2014) "Modeling mechanisms of cognition-emotion interaction in artificial neural networks, since 1981" Procedia Computer Science p. 255-263 (https://core.ac.uk/download/pdf/81973924.pdf {{Webarchive|url=https://web.archive.org/web/20190323204838/https://core.ac.uk/download/pdf/81973924.pdf |date=23 March 2019 }})</ref> It was an example of a debate where an AI system, a recurrent neural network, contributed to an issue in the same time addressed by cognitive psychology.

Two early influential works were the [[Recurrent neural network#Jordan network|Jordan network]] (1986) and the [[Recurrent neural network#Elman network|Elman network]] (1990), which applied RNN to study [[cognitive psychology]]. 

In the 1980s, backpropagation did not work well for deep RNNs. To overcome this problem, in 1991, [[Jürgen Schmidhuber]] proposed the "neural sequence chunker" or "neural history compressor"<ref name="chunker1991">{{cite journal |last1=Schmidhuber |first1=Jürgen |date=April 1991 |title=Neural Sequence Chunkers |author-link=Jürgen Schmidhuber |url=https://people.idsia.ch/~juergen/FKI-148-91ocr.pdf |journal=TR FKI-148, TU Munich |archive-date=14 September 2024 |access-date=21 September 2024 |archive-url=https://web.archive.org/web/20240914162750/https://people.idsia.ch/~juergen/FKI-148-91ocr.pdf |url-status=live }}</ref><ref name="schmidhuber1992">{{cite journal |last1=Schmidhuber |first1=Jürgen |year=1992 |title=Learning complex, extended sequences using the principle of history compression (based on TR FKI-148, 1991) |url=https://sferics.idsia.ch/pub/juergen/chunker.pdf |journal=Neural Computation |volume=4 |issue=2 |pages=234–242 |doi=10.1162/neco.1992.4.2.234 |s2cid=18271205 |archive-date=14 September 2024 |access-date=21 September 2024 |archive-url=https://web.archive.org/web/20240914162750/https://sferics.idsia.ch/pub/juergen/chunker.pdf |url-status=live }}</ref> which introduced the important concepts of self-supervised pre-training (the "P" in [[ChatGPT]]) and neural [[knowledge distillation]].<ref name=DLhistory/> In 1993, a neural history compressor system solved a "Very Deep Learning" task that required more than 1000 subsequent [[Layer (deep learning)|layers]] in an RNN unfolded in time.<ref name="schmidhuber19932">{{Cite book |last=Schmidhuber |first=Jürgen |url=https://sferics.idsia.ch/pub/juergen/habilitation.pdf |title=Habilitation thesis: System modeling and optimization |year=1993 |archive-date=7 August 2024 |access-date=21 September 2024 |archive-url=https://web.archive.org/web/20240807084323/https://sferics.idsia.ch/pub/juergen/habilitation.pdf |url-status=live }} Page 150 ff demonstrates credit assignment across the equivalent of 1,200 layers in an unfolded RNN.</ref>

In 1991, [[Sepp Hochreiter]]'s diploma thesis<ref name="HOCH1991">S. Hochreiter., "[http://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdvisorSchmidhuber.pdf Untersuchungen zu dynamischen neuronalen Netzen]", {{Webarchive|url=https://web.archive.org/web/20150306075401/http://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdvisorSchmidhuber.pdf|date=6 March 2015}}, ''Diploma thesis. Institut f. Informatik, Technische Univ. Munich. Advisor: J. Schmidhuber'', 1991.</ref> identified and analyzed the [[vanishing gradient problem]]<ref name="HOCH1991" /><ref name="HOCH2001">{{cite book |last=Hochreiter |first=S. |title=A Field Guide to Dynamical Recurrent Networks |date=15 January 2001 |publisher=John Wiley & Sons |isbn=978-0-7803-5369-5 |editor-last1=Kolen |editor-first1=John F. |chapter=Gradient flow in recurrent nets: the difficulty of learning long-term dependencies |display-authors=etal |editor-last2=Kremer |editor-first2=Stefan C. |chapter-url=https://books.google.com/books?id=NWOcMVA64aAC |access-date=26 June 2017 |archive-date=19 May 2024 |archive-url=https://web.archive.org/web/20240519081124/https://books.google.com/books?id=NWOcMVA64aAC |url-status=live }}</ref> and proposed recurrent [[Residual neural network|residual]] connections to solve it. He and Schmidhuber introduced [[long short-term memory]] (LSTM), which set accuracy records in multiple applications domains.<ref>{{Cite Q|Q98967430}}</ref><ref name="lstm2">{{Cite journal |last1=Hochreiter |first1=Sepp |author-link=Sepp Hochreiter |last2=Schmidhuber |first2=Jürgen |date=1 November 1997 |title=Long Short-Term Memory |journal=Neural Computation |volume=9 |issue=8 |pages=1735–1780 |doi=10.1162/neco.1997.9.8.1735 |pmid=9377276 |s2cid=1915014}}</ref> This was not yet the modern version of LSTM, which required the forget gate, which was introduced in 1999.<ref name="lstm1999">{{Cite book |last1=Gers |first1=Felix |title=9th International Conference on Artificial Neural Networks: ICANN '99 |last2=Schmidhuber |first2=Jürgen |last3=Cummins |first3=Fred |year=1999 |isbn=0-85296-721-7 |volume=1999 |pages=850–855 |chapter=Learning to forget: Continual prediction with LSTM |doi=10.1049/cp:19991218}}</ref> It became the default choice for RNN architecture.

During 1985–1995, inspired by statistical mechanics, several architectures and methods were developed by [[Terry Sejnowski]], [[Peter Dayan]], [[Geoffrey Hinton]], etc., including the [[Boltzmann machine]],<ref>{{Cite journal |last1=Ackley |first1=David H. |last2=Hinton |first2=Geoffrey E. |last3=Sejnowski |first3=Terrence J. |date=1 January 1985 |title=A learning algorithm for boltzmann machines |url=https://www.sciencedirect.com/science/article/pii/S0364021385800124 |journal=Cognitive Science |volume=9 |issue=1 |pages=147–169 |doi=10.1016/S0364-0213(85)80012-4 |issn=0364-0213 |archive-date=17 September 2024 |access-date=7 August 2024 |archive-url=https://web.archive.org/web/20240917124802/https://www.sciencedirect.com/science/article/pii/S0364021385800124 |url-status=live }}</ref> [[restricted Boltzmann machine]],<ref>{{cite book |last=Smolensky |first=Paul |title=Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations |title-link=Connectionism |publisher=MIT Press |year=1986 |isbn=0-262-68053-X |editor1-last=Rumelhart |editor1-first=David E. |pages=[https://archive.org/details/paralleldistribu00rume/page/194 194–281] |chapter=Chapter 6: Information Processing in Dynamical Systems: Foundations of Harmony Theory |editor2-last=McLelland |editor2-first=James L. |chapter-url=https://stanford.edu/~jlmcc/papers/PDP/Volume%201/Chap6_PDP86.pdf |archive-date=14 July 2023 |access-date=7 August 2024 |archive-url=https://web.archive.org/web/20230714174222/https://stanford.edu/~jlmcc/papers/PDP/Volume%201/Chap6_PDP86.pdf |url-status=live }}</ref> [[Helmholtz machine]],<ref name="“nc95“">{{Cite journal |last1=Peter |first1=Dayan |author-link1=Peter Dayan |last2=Hinton |first2=Geoffrey E. |author-link2=Geoffrey Hinton |last3=Neal |first3=Radford M. |author-link3=Radford M. Neal |last4=Zemel |first4=Richard S. |author-link4=Richard Zemel |date=1995 |title=The Helmholtz machine. |journal=Neural Computation |volume=7 |issue=5 |pages=889–904 |doi=10.1162/neco.1995.7.5.889 |pmid=7584891 |s2cid=1890561 |hdl-access=free |hdl=21.11116/0000-0002-D6D3-E}} {{closed access}}</ref> and the [[wake-sleep algorithm]].<ref name=":13">{{Cite journal |last1=Hinton |first1=Geoffrey E. |author-link=Geoffrey Hinton |last2=Dayan |first2=Peter |author-link2=Peter Dayan |last3=Frey |first3=Brendan J. |author-link3=Brendan Frey |last4=Neal |first4=Radford |date=26 May 1995 |title=The wake-sleep algorithm for unsupervised neural networks |journal=Science |volume=268 |issue=5214 |pages=1158–1161 |bibcode=1995Sci...268.1158H |doi=10.1126/science.7761831 |pmid=7761831 |s2cid=871473}}</ref> These were designed for unsupervised learning of deep generative models.