Editing Stochastic gradient descent (section)

==History==
In 1951, [[Herbert Robbins]] and [[John U. Monro#Personal life|Sutton Monro]] introduced the earliest stochastic approximation methods, preceding stochastic gradient descent.<ref name="rm">{{Cite journal |last1=Robbins |first1=H. |author-link=Herbert Robbins |last2=Monro |first2=S. |year=1951 |title=A Stochastic Approximation Method |journal=The Annals of Mathematical Statistics |volume=22 |issue=3 |pages=400 |doi=10.1214/aoms/1177729586 |doi-access=free}}</ref> Building on this work one year later, [[Jack Kiefer (statistician)|Jack Kiefer]] and [[Jacob Wolfowitz]] published [[Stochastic approximation#Kiefer–Wolfowitz algorithm|an optimization algorithm]] very close to stochastic gradient descent, using [[Finite difference#Basic types|central differences]] as an approximation of the gradient.<ref>{{Cite journal |title=Stochastic Estimation of the Maximum of a Regression Function |date=1952 |doi=10.1214/aoms/1177729392 |last1=Kiefer |first1=J. |last2=Wolfowitz |first2=J. |journal=The Annals of Mathematical Statistics |volume=23 |issue=3 |pages=462–466 |doi-access=free }}</ref> Later in the 1950s, [[Frank Rosenblatt]] used SGD to optimize his [[Perceptron|perceptron model]], demonstrating the first applicability of stochastic gradient descent to neural networks.<ref>{{Cite journal |title=The perceptron: A probabilistic model for information storage and organization in the brain. |date=1958 |doi=10.1037/h0042519 |last1=Rosenblatt |first1=F. |journal=Psychological Review |volume=65 |issue=6 |pages=386–408 |pmid=13602029 |s2cid=12781225 }}</ref>

[[Backpropagation]] was first described in 1986, with stochastic gradient descent being used to efficiently optimize parameters across neural networks with multiple [[Artificial neural network|hidden layers]]. Soon after, another improvement was developed: mini-batch gradient descent, where small batches of data are substituted for single samples. In 1997, the practical performance benefits from vectorization achievable with such small batches were first explored,<ref>{{cite conference |last1=Bilmes |first1=Jeff |last2=Asanovic |first2=Krste |author2-link=Krste Asanović |last3=Chin |first3=Chee-Whye |last4=Demmel |first4=James |date=April 1997 |title=Using PHiPAC to speed error back-propagation learning |url=https://ieeexplore.ieee.org/document/604861 |conference=ICASSP |location=Munich, Germany |publisher=IEEE |pages=4153–4156 vol.5 |doi=10.1109/ICASSP.1997.604861 |book-title=1997 IEEE International Conference on Acoustics, Speech, and Signal Processing}}</ref> paving the way for efficient optimization in machine learning. As of 2023, this mini-batch approach remains the norm for training neural networks, balancing the benefits of stochastic gradient descent with [[gradient descent]].<ref>{{Cite journal |title=Accelerating Minibatch Stochastic Gradient Descent Using Typicality Sampling |url=https://ieeexplore.ieee.org/document/8945166 |access-date=2023-10-02 |journal=IEEE Transactions on Neural Networks and Learning Systems |date=2020 |doi=10.1109/TNNLS.2019.2957003 |language=en-US |last1=Peng |first1=Xinyu |last2=Li |first2=Li |last3=Wang |first3=Fei-Yue |volume=31 |issue=11 |pages=4649–4659 |pmid=31899442 |arxiv=1903.04192 |s2cid=73728964 }}</ref>

By the 1980s, [[Momentum (machine learning)|momentum]] had already been introduced, and was added to SGD optimization techniques in 1986.<ref>{{Cite journal |last1=Rumelhart |first1=David E. |last2=Hinton |first2=Geoffrey E. |last3=Williams |first3=Ronald J. |date=October 1986 |title=Learning representations by back-propagating errors |url=https://www.nature.com/articles/323533a0 |journal=Nature |language=en |volume=323 |issue=6088 |pages=533–536 |doi=10.1038/323533a0 |bibcode=1986Natur.323..533R |s2cid=205001834 |issn=1476-4687}}</ref> However, these optimization techniques assumed constant [[Hyperparameter (machine learning)|hyperparameters]], i.e. a fixed learning rate and momentum parameter. In the 2010s, adaptive approaches to applying SGD with a per-parameter learning rate were introduced with AdaGrad (for "Adaptive Gradient") in 2011<ref name="duchi2">{{cite journal |last1=Duchi |first1=John |last2=Hazan |first2=Elad |last3=Singer |first3=Yoram |year=2011 |title=Adaptive subgradient methods for online learning and stochastic optimization |url=http://jmlr.org/papers/volume12/duchi11a/duchi11a.pdf |journal=[[Journal of Machine Learning Research|JMLR]] |volume=12 |pages=2121–2159}}</ref> and RMSprop (for "Root Mean Square Propagation") in 2012.<ref name="rmsprop2">{{Cite web |last=Hinton |first=Geoffrey |author-link=Geoffrey Hinton |title=Lecture 6e rmsprop: Divide the gradient by a running average of its recent magnitude |url=http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf |access-date=19 March 2020 |pages=26}}</ref> In 2014, Adam (for "Adaptive Moment Estimation") was published, applying the adaptive approaches of RMSprop to momentum; many improvements and branches of Adam were then developed such as Adadelta, Adagrad, AdamW, and Adamax.<ref name="Adam20142">{{cite arXiv |eprint=1412.6980 |class=cs.LG |first1=Diederik |last1=Kingma |first2=Jimmy |last2=Ba |title=Adam: A Method for Stochastic Optimization |year=2014}}</ref><ref name="pytorch.org">{{Cite web |title=torch.optim — PyTorch 2.0 documentation |url=https://pytorch.org/docs/stable/optim.html |access-date=2023-10-02 |website=pytorch.org}}</ref>

Within machine learning, approaches to optimization in 2023 are dominated by Adam-derived optimizers. TensorFlow and PyTorch, by far the most popular machine learning libraries,<ref>{{Cite journal |last1=Nguyen |first1=Giang |last2=Dlugolinsky |first2=Stefan |last3=Bobák |first3=Martin |last4=Tran |first4=Viet |last5=García |first5=Álvaro |last6=Heredia |first6=Ignacio |last7=Malík |first7=Peter |last8=Hluchý |first8=Ladislav |date=19 January 2019 |title=Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey |url=https://link.springer.com/content/pdf/10.1007/s10462-018-09679-z.pdf |journal=Artificial Intelligence Review|volume=52 |pages=77–124 |doi=10.1007/s10462-018-09679-z |s2cid=254236976 }}</ref> as of 2023 largely only include Adam-derived optimizers, as well as predecessors to Adam such as RMSprop and classic SGD. PyTorch also partially supports [[Limited-memory BFGS]], a line-search method, but only for single-device setups without parameter groups.<ref name="pytorch.org"/><ref>{{Cite web |title=Module: tf.keras.optimizers {{!}} TensorFlow v2.14.0 |url=https://www.tensorflow.org/api_docs/python/tf/keras/optimizers |access-date=2023-10-02 |website=TensorFlow |language=en}}</ref>