Editing Stochastic gradient descent (section)

===Implicit updates (ISGD)===
As mentioned earlier, classical stochastic gradient descent is generally sensitive to [[learning rate]] {{mvar|η}}. Fast convergence requires large learning rates but this may induce numerical instability. The problem can be largely solved<ref>{{cite journal |last1=Toulis |first1=Panos |first2=Edoardo |last2=Airoldi|title=Asymptotic and finite-sample properties of estimators based on stochastic gradients |journal=Annals of Statistics |volume=45 |issue=4 |year=2017 |pages=1694–1727|doi=10.1214/16-AOS1506|arxiv=1408.2923 |s2cid=10279395 }}</ref> by considering ''implicit updates'' whereby the stochastic gradient is evaluated at the next iterate rather than the current one:
<math display="block">w^\text{new} := w^\text{old} - \eta\, \nabla Q_i(w^\text{new}).</math>

This equation is implicit since <math>w^\text{new}</math> appears on both sides of the equation. It is a stochastic form of the [[proximal gradient method]] since the update
can also be written as:
<math display="block">w^\text{new} := \arg\min_w \left\{ Q_i(w) + \frac{1}{2\eta} \left\|w - w^\text{old}\right\|^2 \right\}.</math>

As an example, 
consider least squares with features <math>x_1, \ldots, x_n \in\mathbb{R}^p</math> and observations
<math>y_1, \ldots, y_n\in\mathbb{R}</math>. We wish to solve:
<math display="block">\min_w \sum_{j=1}^n \left(y_j - x_j'w\right)^2,</math>
where <math>x_j' w = x_{j1} w_1 + x_{j, 2} w_2 + ... + x_{j,p} w_p</math>  indicates the inner product.
Note that <math>x</math> could have "1" as the first element to include an intercept. Classical stochastic gradient descent proceeds as follows:
<math display="block">w^\text{new} = w^\text{old} + \eta \left(y_i - x_i'w^\text{old}\right) x_i</math>

where <math>i</math> is uniformly sampled between 1 and <math>n</math>. Although theoretical convergence of this procedure happens under relatively mild assumptions, in practice the procedure can be quite unstable. In particular, when <math>\eta</math> is misspecified so that <math>I - \eta x_i x_i'</math> has large absolute eigenvalues with high probability, the procedure may diverge numerically within a few iterations. In contrast, ''implicit stochastic gradient descent'' (shortened as ISGD) can be solved in closed-form as:
<math display="block">w^\text{new} = w^\text{old} + \frac{\eta}{1 + \eta \left\|x_i\right\|^2} \left(y_i - x_i'w^\text{old}\right) x_i.</math>

This procedure will remain numerically stable virtually for all <math>\eta</math> as the [[learning rate]] is now normalized. Such comparison between classical and implicit stochastic gradient descent in the least squares problem is very similar to the comparison between [[Least mean squares filter|least mean squares (LMS)]] and 
[[Least mean squares filter#Normalized least mean squares filter (NLMS)|normalized least mean squares filter (NLMS)]].

Even though a closed-form solution for ISGD is only possible in least squares, the procedure can be efficiently implemented in a wide range of models. Specifically, suppose that <math>Q_i(w)</math> depends on <math>w</math> only through a linear combination with features <math>x_i</math>, so that we can write <math>\nabla_w Q_i(w) = -q(x_i'w) x_i</math>, where <math>q() \in\mathbb{R}</math> may depend on <math>x_i, y_i</math> as well but not on <math>w</math> except through <math>x_i'w</math>. Least squares obeys this rule, and so does [[logistic regression]], and most [[generalized linear model]]s. For instance, in least squares, <math>q(x_i'w) = y_i - x_i'w</math>, and in logistic regression <math>q(x_i'w) = y_i - S(x_i'w)</math>, where <math>S(u) = e^u/(1+e^u)</math> is the [[logistic function]]. In [[Poisson regression]], <math>q(x_i'w) = y_i - e^{x_i'w}</math>, and so on.

In such settings, ISGD is simply implemented as follows. Let <math>f(\xi) = \eta q(x_i'w^\text{old} + \xi \|x_i\|^2)</math>, where <math>\xi</math> is scalar.
Then, ISGD is equivalent to:
<math display="block">w^\text{new} = w^\text{old} + \xi^\ast x_i,~\text{where}~\xi^\ast = f(\xi^\ast).</math>

The scaling factor <math>\xi^\ast\in\mathbb{R}</math> can be found through the [[bisection method]] since in most regular models, such as the aforementioned generalized linear models, function <math>q()</math> is decreasing, and thus the search bounds for <math>\xi^\ast</math> are <math>[\min(0, f(0)), \max(0, f(0))]</math>.