Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Stochastic gradient descent
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==History== In 1951, [[Herbert Robbins]] and [[John U. Monro#Personal life|Sutton Monro]] introduced the earliest stochastic approximation methods, preceding stochastic gradient descent.<ref name="rm">{{Cite journal |last1=Robbins |first1=H. |author-link=Herbert Robbins |last2=Monro |first2=S. |year=1951 |title=A Stochastic Approximation Method |journal=The Annals of Mathematical Statistics |volume=22 |issue=3 |pages=400 |doi=10.1214/aoms/1177729586 |doi-access=free}}</ref> Building on this work one year later, [[Jack Kiefer (statistician)|Jack Kiefer]] and [[Jacob Wolfowitz]] published [[Stochastic approximation#Kiefer–Wolfowitz algorithm|an optimization algorithm]] very close to stochastic gradient descent, using [[Finite difference#Basic types|central differences]] as an approximation of the gradient.<ref>{{Cite journal |title=Stochastic Estimation of the Maximum of a Regression Function |date=1952 |doi=10.1214/aoms/1177729392 |last1=Kiefer |first1=J. |last2=Wolfowitz |first2=J. |journal=The Annals of Mathematical Statistics |volume=23 |issue=3 |pages=462–466 |doi-access=free }}</ref> Later in the 1950s, [[Frank Rosenblatt]] used SGD to optimize his [[Perceptron|perceptron model]], demonstrating the first applicability of stochastic gradient descent to neural networks.<ref>{{Cite journal |title=The perceptron: A probabilistic model for information storage and organization in the brain. |date=1958 |doi=10.1037/h0042519 |last1=Rosenblatt |first1=F. |journal=Psychological Review |volume=65 |issue=6 |pages=386–408 |pmid=13602029 |s2cid=12781225 }}</ref> [[Backpropagation]] was first described in 1986, with stochastic gradient descent being used to efficiently optimize parameters across neural networks with multiple [[Artificial neural network|hidden layers]]. Soon after, another improvement was developed: mini-batch gradient descent, where small batches of data are substituted for single samples. In 1997, the practical performance benefits from vectorization achievable with such small batches were first explored,<ref>{{cite conference |last1=Bilmes |first1=Jeff |last2=Asanovic |first2=Krste |author2-link=Krste Asanović |last3=Chin |first3=Chee-Whye |last4=Demmel |first4=James |date=April 1997 |title=Using PHiPAC to speed error back-propagation learning |url=https://ieeexplore.ieee.org/document/604861 |conference=ICASSP |location=Munich, Germany |publisher=IEEE |pages=4153–4156 vol.5 |doi=10.1109/ICASSP.1997.604861 |book-title=1997 IEEE International Conference on Acoustics, Speech, and Signal Processing}}</ref> paving the way for efficient optimization in machine learning. As of 2023, this mini-batch approach remains the norm for training neural networks, balancing the benefits of stochastic gradient descent with [[gradient descent]].<ref>{{Cite journal |title=Accelerating Minibatch Stochastic Gradient Descent Using Typicality Sampling |url=https://ieeexplore.ieee.org/document/8945166 |access-date=2023-10-02 |journal=IEEE Transactions on Neural Networks and Learning Systems |date=2020 |doi=10.1109/TNNLS.2019.2957003 |language=en-US |last1=Peng |first1=Xinyu |last2=Li |first2=Li |last3=Wang |first3=Fei-Yue |volume=31 |issue=11 |pages=4649–4659 |pmid=31899442 |arxiv=1903.04192 |s2cid=73728964 }}</ref> By the 1980s, [[Momentum (machine learning)|momentum]] had already been introduced, and was added to SGD optimization techniques in 1986.<ref>{{Cite journal |last1=Rumelhart |first1=David E. |last2=Hinton |first2=Geoffrey E. |last3=Williams |first3=Ronald J. |date=October 1986 |title=Learning representations by back-propagating errors |url=https://www.nature.com/articles/323533a0 |journal=Nature |language=en |volume=323 |issue=6088 |pages=533–536 |doi=10.1038/323533a0 |bibcode=1986Natur.323..533R |s2cid=205001834 |issn=1476-4687}}</ref> However, these optimization techniques assumed constant [[Hyperparameter (machine learning)|hyperparameters]], i.e. a fixed learning rate and momentum parameter. In the 2010s, adaptive approaches to applying SGD with a per-parameter learning rate were introduced with AdaGrad (for "Adaptive Gradient") in 2011<ref name="duchi2">{{cite journal |last1=Duchi |first1=John |last2=Hazan |first2=Elad |last3=Singer |first3=Yoram |year=2011 |title=Adaptive subgradient methods for online learning and stochastic optimization |url=http://jmlr.org/papers/volume12/duchi11a/duchi11a.pdf |journal=[[Journal of Machine Learning Research|JMLR]] |volume=12 |pages=2121–2159}}</ref> and RMSprop (for "Root Mean Square Propagation") in 2012.<ref name="rmsprop2">{{Cite web |last=Hinton |first=Geoffrey |author-link=Geoffrey Hinton |title=Lecture 6e rmsprop: Divide the gradient by a running average of its recent magnitude |url=http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf |access-date=19 March 2020 |pages=26}}</ref> In 2014, Adam (for "Adaptive Moment Estimation") was published, applying the adaptive approaches of RMSprop to momentum; many improvements and branches of Adam were then developed such as Adadelta, Adagrad, AdamW, and Adamax.<ref name="Adam20142">{{cite arXiv |eprint=1412.6980 |class=cs.LG |first1=Diederik |last1=Kingma |first2=Jimmy |last2=Ba |title=Adam: A Method for Stochastic Optimization |year=2014}}</ref><ref name="pytorch.org">{{Cite web |title=torch.optim — PyTorch 2.0 documentation |url=https://pytorch.org/docs/stable/optim.html |access-date=2023-10-02 |website=pytorch.org}}</ref> Within machine learning, approaches to optimization in 2023 are dominated by Adam-derived optimizers. TensorFlow and PyTorch, by far the most popular machine learning libraries,<ref>{{Cite journal |last1=Nguyen |first1=Giang |last2=Dlugolinsky |first2=Stefan |last3=Bobák |first3=Martin |last4=Tran |first4=Viet |last5=García |first5=Álvaro |last6=Heredia |first6=Ignacio |last7=Malík |first7=Peter |last8=Hluchý |first8=Ladislav |date=19 January 2019 |title=Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey |url=https://link.springer.com/content/pdf/10.1007/s10462-018-09679-z.pdf |journal=Artificial Intelligence Review|volume=52 |pages=77–124 |doi=10.1007/s10462-018-09679-z |s2cid=254236976 }}</ref> as of 2023 largely only include Adam-derived optimizers, as well as predecessors to Adam such as RMSprop and classic SGD. PyTorch also partially supports [[Limited-memory BFGS]], a line-search method, but only for single-device setups without parameter groups.<ref name="pytorch.org"/><ref>{{Cite web |title=Module: tf.keras.optimizers {{!}} TensorFlow v2.14.0 |url=https://www.tensorflow.org/api_docs/python/tf/keras/optimizers |access-date=2023-10-02 |website=TensorFlow |language=en}}</ref>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)