Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Stochastic gradient descent
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Linear regression== Suppose we want to fit a straight line <math>\hat y = w_1 + w_2 x</math> to a training set with observations <math>((x_1, y_1), (x_2, y_2) \ldots, (x_n, y_n))</math> and corresponding estimated responses <math>(\hat y_1, \hat y_2, \ldots, \hat y_n)</math> using [[least squares]]. The objective function to be minimized is <math display="block">Q(w) = \sum_{i=1}^n Q_i(w) = \sum_{i=1}^n \left(\hat y_i - y_i\right)^2 = \sum_{i=1}^n \left(w_1 + w_2 x_i - y_i\right)^2.</math> The last line in the above pseudocode for this specific problem will become: <math display="block">\begin{bmatrix} w_1 \\ w_2 \end{bmatrix} \leftarrow \begin{bmatrix} w_1 \\ w_2 \end{bmatrix} - \eta \begin{bmatrix} \frac{\partial}{\partial w_1} (w_1 + w_2 x_i - y_i)^2 \\ \frac{\partial}{\partial w_2} (w_1 + w_2 x_i - y_i)^2 \end{bmatrix} = \begin{bmatrix} w_1 \\ w_2 \end{bmatrix} - \eta \begin{bmatrix} 2 (w_1 + w_2 x_i - y_i) \\ 2 x_i(w_1 + w_2 x_i - y_i) \end{bmatrix}.</math>Note that in each iteration or update step, the gradient is only evaluated at a single <math>x_i</math>. This is the key difference between stochastic gradient descent and batched gradient descent. In general, given a linear regression <math>\hat y = \sum_{k\in 1:m} w_k x_k</math> problem, stochastic gradient descent behaves differently when <math>m < n</math> (underparameterized) and <math>m \geq n</math> (overparameterized). In the overparameterized case, stochastic gradient descent converges to <math>\arg\min_{w: w^T x_k =y_k \forall k \in 1:n} \|w - w_0\|</math>. That is, SGD converges to the interpolation solution with minimum distance from the starting <math>w_0</math>. This is true even when the learning rate remains constant. In the underparameterized case, SGD does not converge if learning rate remains constant.<ref>{{Cite journal |last=Belkin |first=Mikhail |date=May 2021 |title=Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation |url=https://www.cambridge.org/core/journals/acta-numerica/article/abs/fit-without-fear-remarkable-mathematical-phenomena-of-deep-learning-through-the-prism-of-interpolation/DBAC769EB7F4DBA5C4720932C2826014 |journal=Acta Numerica |language=en |volume=30 |pages=203β248 |doi=10.1017/S0962492921000039 |arxiv=2105.14368 |issn=0962-4929}}</ref>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)