Editing Stochastic gradient descent (section)

==Linear regression==
Suppose we want to fit a straight line <math>\hat y = w_1 + w_2 x</math> to a training set with observations <math>((x_1, y_1), (x_2, y_2) \ldots, (x_n, y_n))</math> and corresponding estimated responses <math>(\hat y_1, \hat y_2, \ldots, \hat y_n)</math> using [[least squares]]. The objective function to be minimized is
<math display="block">Q(w) = \sum_{i=1}^n Q_i(w) = \sum_{i=1}^n \left(\hat y_i - y_i\right)^2 = \sum_{i=1}^n \left(w_1 + w_2 x_i - y_i\right)^2.</math>
The last line in the above pseudocode for this specific problem will become:
<math display="block">\begin{bmatrix} w_1 \\ w_2 \end{bmatrix} \leftarrow
 \begin{bmatrix} w_1 \\ w_2 \end{bmatrix}
 - \eta \begin{bmatrix} \frac{\partial}{\partial w_1} (w_1 + w_2 x_i - y_i)^2 \\
   \frac{\partial}{\partial w_2} (w_1 + w_2 x_i - y_i)^2 \end{bmatrix} =
 \begin{bmatrix} w_1 \\ w_2 \end{bmatrix}
 -  \eta  \begin{bmatrix} 2 (w_1 + w_2 x_i - y_i) \\ 2 x_i(w_1 + w_2 x_i - y_i) \end{bmatrix}.</math>Note that in each iteration or update step, the gradient is only evaluated at a single <math>x_i</math>. This is the key difference between stochastic gradient descent and batched gradient descent.

In general, given a linear regression <math>\hat y = \sum_{k\in 1:m} w_k x_k</math> problem, stochastic gradient descent behaves differently when <math>m < n</math> (underparameterized) and <math>m \geq n</math> (overparameterized). In the overparameterized case, stochastic gradient descent converges to <math>\arg\min_{w: w^T x_k =y_k \forall k \in 1:n} \|w - w_0\|</math>. That is, SGD converges to the interpolation solution with minimum distance from the starting <math>w_0</math>. This is true even when the learning rate remains constant. In the underparameterized case, SGD does not converge if learning rate remains constant.<ref>{{Cite journal |last=Belkin |first=Mikhail |date=May 2021 |title=Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation |url=https://www.cambridge.org/core/journals/acta-numerica/article/abs/fit-without-fear-remarkable-mathematical-phenomena-of-deep-learning-through-the-prism-of-interpolation/DBAC769EB7F4DBA5C4720932C2826014 |journal=Acta Numerica |language=en |volume=30 |pages=203–248 |doi=10.1017/S0962492921000039 |arxiv=2105.14368 |issn=0962-4929}}</ref>