Editing Stochastic gradient descent (section)

==Iterative method==
[[Image:stogra.png|thumb|right|Fluctuations in the total objective function as gradient steps with respect to mini-batches are taken.]]

In stochastic (or "on-line") gradient descent, the true gradient of <math>Q(w)</math> is approximated by a gradient at a single sample:
<math display="block">w := w - \eta\, \nabla Q_i(w).</math>
As the algorithm sweeps through the training set, it performs the above update for each training sample. Several passes can be made over the training set until the algorithm converges. If this is done, the data can be shuffled for each pass to prevent cycles. Typical implementations may use an [[adaptive learning rate]] so that the algorithm converges.<ref>{{cite book |last1=Murphy |first1=Kevin |title=Probabilistic Machine Learning: An Introduction |url=https://probml.github.io/pml-book/book1.html |access-date=10 April 2021 |date=2021 |publisher=MIT Press}}</ref>

In pseudocode, stochastic gradient descent can be presented as :
<div style="margin-left: 35px; width: 600px">
{{framebox|blue}}
* Choose an initial vector of parameters <math>w</math> and learning rate <math>\eta</math>.
* Repeat until an approximate minimum is obtained:
** Randomly shuffle samples in the training set.
** For <math> i=1, 2, ..., n</math>, do:
*** <math> w := w - \eta\, \nabla Q_i(w).</math>
{{frame-footer}}
</div>

A compromise between computing the true gradient and the gradient at a single sample is to compute the gradient against more than one training sample (called a "mini-batch") at each step.  This can perform significantly better than "true" stochastic gradient descent described, because the code can make use of [[Vectorization (mathematics)|vectorization]] libraries rather than computing each step separately as was first shown in <ref>{{cite conference 
| url = https://ieeexplore.ieee.org/document/604861
| title = Using PHiPAC to speed error back-propagation learning
| last1 = Bilmes
| first1 = Jeff
| last2 = Asanovic
| first2 = Krste
| author2-link = Krste Asanović
| last3 = Chin
| first3 = Chee-Whye
| last4 = Demmel
| first4 = James
| date = April 1997
| publisher = IEEE
| book-title = 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing
| pages = 4153–4156 vol.5
| conference=ICASSP
| location = Munich, Germany
| doi=10.1109/ICASSP.1997.604861
}}</ref> where it was called "the bunch-mode back-propagation algorithm".  It may also result in smoother convergence, as the gradient computed at each step is averaged over more training samples.

The convergence of stochastic gradient descent has been analyzed using the theories of [[convex optimization|convex minimization]] and of [[stochastic approximation]]. Briefly, when the [[learning rate]]s <math>\eta</math> decrease with an appropriate rate,
and subject to relatively mild assumptions, stochastic gradient descent converges [[almost surely]] to a global minimum 
when the objective function is [[convex function|convex]] or [[pseudoconvex function|pseudoconvex]], 
and otherwise converges almost surely to a local minimum.<ref name="Bottou 1998"/><ref>{{cite journal
  |last=Kiwiel
  |first=Krzysztof C.
  |title=Convergence and efficiency of subgradient methods for quasiconvex minimization
  |journal=Mathematical Programming, Series A
  |publisher=Springer|location=Berlin, Heidelberg
  |issn=0025-5610|pages=1–25|volume=90|issue=1
  |doi=10.1007/PL00011414|year=2001 |mr=1819784|s2cid=10043417
 }}</ref>
This is in fact a consequence of the [[Robbins–Siegmund theorem]].<ref>{{Cite book
  |last1=Robbins
  |first1=Herbert
  |author1-link=Herbert Robbins
  |last2=Siegmund
  |first2=David O.
  |author2-link=David O. Siegmund
  |contribution=A convergence theorem for non negative almost supermartingales and some applications
  |title=Optimizing Methods in Statistics
  |publisher=Academic Press
  |year=1971
  |isbn=0-12-604550-X
  |editor-last=Rustagi
  |editor-first=Jagdish S.
  }}
</ref>