Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Stochastic gradient descent
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Iterative method== [[Image:stogra.png|thumb|right|Fluctuations in the total objective function as gradient steps with respect to mini-batches are taken.]] In stochastic (or "on-line") gradient descent, the true gradient of <math>Q(w)</math> is approximated by a gradient at a single sample: <math display="block">w := w - \eta\, \nabla Q_i(w).</math> As the algorithm sweeps through the training set, it performs the above update for each training sample. Several passes can be made over the training set until the algorithm converges. If this is done, the data can be shuffled for each pass to prevent cycles. Typical implementations may use an [[adaptive learning rate]] so that the algorithm converges.<ref>{{cite book |last1=Murphy |first1=Kevin |title=Probabilistic Machine Learning: An Introduction |url=https://probml.github.io/pml-book/book1.html |access-date=10 April 2021 |date=2021 |publisher=MIT Press}}</ref> In pseudocode, stochastic gradient descent can be presented as : <div style="margin-left: 35px; width: 600px"> {{framebox|blue}} * Choose an initial vector of parameters <math>w</math> and learning rate <math>\eta</math>. * Repeat until an approximate minimum is obtained: ** Randomly shuffle samples in the training set. ** For <math> i=1, 2, ..., n</math>, do: *** <math> w := w - \eta\, \nabla Q_i(w).</math> {{frame-footer}} </div> A compromise between computing the true gradient and the gradient at a single sample is to compute the gradient against more than one training sample (called a "mini-batch") at each step. This can perform significantly better than "true" stochastic gradient descent described, because the code can make use of [[Vectorization (mathematics)|vectorization]] libraries rather than computing each step separately as was first shown in <ref>{{cite conference | url = https://ieeexplore.ieee.org/document/604861 | title = Using PHiPAC to speed error back-propagation learning | last1 = Bilmes | first1 = Jeff | last2 = Asanovic | first2 = Krste | author2-link = Krste Asanović | last3 = Chin | first3 = Chee-Whye | last4 = Demmel | first4 = James | date = April 1997 | publisher = IEEE | book-title = 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing | pages = 4153–4156 vol.5 | conference=ICASSP | location = Munich, Germany | doi=10.1109/ICASSP.1997.604861 }}</ref> where it was called "the bunch-mode back-propagation algorithm". It may also result in smoother convergence, as the gradient computed at each step is averaged over more training samples. The convergence of stochastic gradient descent has been analyzed using the theories of [[convex optimization|convex minimization]] and of [[stochastic approximation]]. Briefly, when the [[learning rate]]s <math>\eta</math> decrease with an appropriate rate, and subject to relatively mild assumptions, stochastic gradient descent converges [[almost surely]] to a global minimum when the objective function is [[convex function|convex]] or [[pseudoconvex function|pseudoconvex]], and otherwise converges almost surely to a local minimum.<ref name="Bottou 1998"/><ref>{{cite journal |last=Kiwiel |first=Krzysztof C. |title=Convergence and efficiency of subgradient methods for quasiconvex minimization |journal=Mathematical Programming, Series A |publisher=Springer|location=Berlin, Heidelberg |issn=0025-5610|pages=1–25|volume=90|issue=1 |doi=10.1007/PL00011414|year=2001 |mr=1819784|s2cid=10043417 }}</ref> This is in fact a consequence of the [[Robbins–Siegmund theorem]].<ref>{{Cite book |last1=Robbins |first1=Herbert |author1-link=Herbert Robbins |last2=Siegmund |first2=David O. |author2-link=David O. Siegmund |contribution=A convergence theorem for non negative almost supermartingales and some applications |title=Optimizing Methods in Statistics |publisher=Academic Press |year=1971 |isbn=0-12-604550-X |editor-last=Rustagi |editor-first=Jagdish S. }} </ref>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)