Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Stochastic gradient descent
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
===Implicit updates (ISGD)=== As mentioned earlier, classical stochastic gradient descent is generally sensitive to [[learning rate]] {{mvar|Ξ·}}. Fast convergence requires large learning rates but this may induce numerical instability. The problem can be largely solved<ref>{{cite journal |last1=Toulis |first1=Panos |first2=Edoardo |last2=Airoldi|title=Asymptotic and finite-sample properties of estimators based on stochastic gradients |journal=Annals of Statistics |volume=45 |issue=4 |year=2017 |pages=1694β1727|doi=10.1214/16-AOS1506|arxiv=1408.2923 |s2cid=10279395 }}</ref> by considering ''implicit updates'' whereby the stochastic gradient is evaluated at the next iterate rather than the current one: <math display="block">w^\text{new} := w^\text{old} - \eta\, \nabla Q_i(w^\text{new}).</math> This equation is implicit since <math>w^\text{new}</math> appears on both sides of the equation. It is a stochastic form of the [[proximal gradient method]] since the update can also be written as: <math display="block">w^\text{new} := \arg\min_w \left\{ Q_i(w) + \frac{1}{2\eta} \left\|w - w^\text{old}\right\|^2 \right\}.</math> As an example, consider least squares with features <math>x_1, \ldots, x_n \in\mathbb{R}^p</math> and observations <math>y_1, \ldots, y_n\in\mathbb{R}</math>. We wish to solve: <math display="block">\min_w \sum_{j=1}^n \left(y_j - x_j'w\right)^2,</math> where <math>x_j' w = x_{j1} w_1 + x_{j, 2} w_2 + ... + x_{j,p} w_p</math> indicates the inner product. Note that <math>x</math> could have "1" as the first element to include an intercept. Classical stochastic gradient descent proceeds as follows: <math display="block">w^\text{new} = w^\text{old} + \eta \left(y_i - x_i'w^\text{old}\right) x_i</math> where <math>i</math> is uniformly sampled between 1 and <math>n</math>. Although theoretical convergence of this procedure happens under relatively mild assumptions, in practice the procedure can be quite unstable. In particular, when <math>\eta</math> is misspecified so that <math>I - \eta x_i x_i'</math> has large absolute eigenvalues with high probability, the procedure may diverge numerically within a few iterations. In contrast, ''implicit stochastic gradient descent'' (shortened as ISGD) can be solved in closed-form as: <math display="block">w^\text{new} = w^\text{old} + \frac{\eta}{1 + \eta \left\|x_i\right\|^2} \left(y_i - x_i'w^\text{old}\right) x_i.</math> This procedure will remain numerically stable virtually for all <math>\eta</math> as the [[learning rate]] is now normalized. Such comparison between classical and implicit stochastic gradient descent in the least squares problem is very similar to the comparison between [[Least mean squares filter|least mean squares (LMS)]] and [[Least mean squares filter#Normalized least mean squares filter (NLMS)|normalized least mean squares filter (NLMS)]]. Even though a closed-form solution for ISGD is only possible in least squares, the procedure can be efficiently implemented in a wide range of models. Specifically, suppose that <math>Q_i(w)</math> depends on <math>w</math> only through a linear combination with features <math>x_i</math>, so that we can write <math>\nabla_w Q_i(w) = -q(x_i'w) x_i</math>, where <math>q() \in\mathbb{R}</math> may depend on <math>x_i, y_i</math> as well but not on <math>w</math> except through <math>x_i'w</math>. Least squares obeys this rule, and so does [[logistic regression]], and most [[generalized linear model]]s. For instance, in least squares, <math>q(x_i'w) = y_i - x_i'w</math>, and in logistic regression <math>q(x_i'w) = y_i - S(x_i'w)</math>, where <math>S(u) = e^u/(1+e^u)</math> is the [[logistic function]]. In [[Poisson regression]], <math>q(x_i'w) = y_i - e^{x_i'w}</math>, and so on. In such settings, ISGD is simply implemented as follows. Let <math>f(\xi) = \eta q(x_i'w^\text{old} + \xi \|x_i\|^2)</math>, where <math>\xi</math> is scalar. Then, ISGD is equivalent to: <math display="block">w^\text{new} = w^\text{old} + \xi^\ast x_i,~\text{where}~\xi^\ast = f(\xi^\ast).</math> The scaling factor <math>\xi^\ast\in\mathbb{R}</math> can be found through the [[bisection method]] since in most regular models, such as the aforementioned generalized linear models, function <math>q()</math> is decreasing, and thus the search bounds for <math>\xi^\ast</math> are <math>[\min(0, f(0)), \max(0, f(0))]</math>.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)