Editing Stochastic gradient descent (section)

== Approximations in continuous time ==
For small learning rate <math display="inline">\eta</math> stochastic gradient descent <math display="inline">(w_n)_{n \in \N_0}</math> can be viewed as a discretization of the [[gradient flow]] ODE

<math display="block">\frac{d}{dt} W_t = -\nabla Q(W_t)</math>

subject to additional stochastic noise. This approximation is only valid on a finite time-horizon in the following sense: assume that all the coefficients  <math display="inline">Q_i </math> are sufficiently smooth. Let <math display="inline">T >0 </math> and <math display="inline">g: \R^d \to \R </math> be a sufficiently smooth test function. Then, there exists a constant <math display="inline">C>0 </math> such that for all <math display="inline">\eta >0 </math>

<math display="block">\max_{k=0, \dots, \lfloor T/\eta \rfloor } \left|\mathbb E[g(w_k)]-g(W_{k \eta})\right| \le C \eta,</math>

where <math display="inline">\mathbb E </math> denotes taking the expectation with respect to the random choice of indices in the stochastic gradient descent scheme.

Since this approximation does not capture the random fluctuations around the mean behavior of stochastic gradient descent solutions to [[stochastic differential equations]] (SDEs) have been proposed as limiting objects.<ref>{{Cite journal |last1=Li |first1=Qianxiao |last2=Tai |first2=Cheng |last3=E |first3=Weinan |date=2019 |title=Stochastic Modified Equations and Dynamics of Stochastic Gradient Algorithms I: Mathematical Foundations |url=http://jmlr.org/papers/v20/17-526.html |journal=Journal of Machine Learning Research |volume=20 |issue=40 |pages=1–47 |arxiv=1811.01558 |issn=1533-7928}}</ref> More precisely, the solution to the SDE

<math display="block">d W_t = - \nabla \left(Q(W_t)+\tfrac 1 4 \eta |\nabla Q(W_t)|^2\right)dt + \sqrt \eta \Sigma (W_t)^{1/2} dB_t,</math>

for <math display="block">\Sigma(w) = \frac{1}{n^2} \left(\sum_{i=1}^n Q_i(w)-Q(w)\right)\left(\sum_{i=1}^n Q_i(w)-Q(w)\right)^T </math> where <math display="inline">dB_t </math> denotes the [[Ito integral|Ito-integral]] with respect to a [[Brownian motion]] is a more precise approximation in the sense that there exists a constant <math display="inline">C>0 </math> such that

<math display="block">\max_{k=0, \dots, \lfloor T/\eta \rfloor } \left|\mathbb E[g(w_k)]-\mathbb E [g(W_{k \eta})]\right| \le C \eta^2.</math>

However this SDE only approximates the one-point motion of stochastic gradient descent. For an approximation of the [[Flow (mathematics)|stochastic flow]] one has to consider SDEs with infinite-dimensional noise.<ref>
{{Cite arXiv |eprint=2302.07125 |class=math.PR |first1=Benjamin |last1=Gess |first2=Sebastian |last2=Kassing |title=Stochastic Modified Flows, Mean-Field Limits and Dynamics of Stochastic Gradient Descent |date=14 February 2023 |last3=Konarovskyi |first3=Vitalii}}</ref>