Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Stochastic gradient descent
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
===AdaGrad=== ''AdaGrad'' (for adaptive [[Gradient descent|gradient]] algorithm) is a modified stochastic gradient descent algorithm with per-parameter [[learning rate]], first published in 2011.<ref name="duchi">{{cite journal |last1=Duchi |first1=John |first2=Elad |last2=Hazan |first3=Yoram |last3=Singer |title=Adaptive subgradient methods for online learning and stochastic optimization |journal=[[Journal of Machine Learning Research|JMLR]] |volume=12 |year=2011 |pages=2121β2159 |url=http://jmlr.org/papers/volume12/duchi11a/duchi11a.pdf}}</ref> Informally, this increases the learning rate for {{clarify|text=sparser parameters|date=November 2023}} and decreases the learning rate for ones that are less sparse. This strategy often improves convergence performance over standard stochastic gradient descent in settings where data is sparse and sparse parameters are more informative. Examples of such applications include natural language processing and image recognition.<ref name="duchi"/> It still has a base learning rate {{mvar|Ξ·}}, but this is multiplied with the elements of a vector {{math|{''G''<sub>''j'',''j''</sub>} }} which is the diagonal of the [[outer product]] matrix <math display="block">G = \sum_{\tau=1}^t g_\tau g_\tau^\mathsf{T}</math> where <math>g_\tau = \nabla Q_i(w)</math>, the gradient, at iteration {{mvar|Ο}}. The diagonal is given by <math display="block">G_{j,j} = \sum_{\tau=1}^t g_{\tau,j}^2.</math>This vector essentially stores a historical sum of gradient squares by dimension and is updated after every iteration. The formula for an update is now{{efn|<math>\odot</math> denotes the [[Hadamard product (matrices)|element-wise product]].}} <math display="block">w := w - \eta\, \mathrm{diag}(G)^{-\frac{1}{2}} \odot g</math> or, written as per-parameter updates, <math display="block">w_j := w_j - \frac{\eta}{\sqrt{G_{j,j}}} g_j.</math> Each {{math|{''G''<sub>(''i'',''i'')</sub>} }} gives rise to a scaling factor for the learning rate that applies to a single parameter {{math|''w''<sub>''i''</sub>}}. Since the denominator in this factor, <math display="inline">\sqrt{G_i} = \sqrt{\sum_{\tau=1}^t g_\tau^2}</math> is the [[Norm (mathematics)#Euclidean norm|''β''<sub>2</sub> norm]] of previous derivatives, extreme parameter updates get dampened, while parameters that get few or small updates receive higher learning rates.<ref name="Zeiler 2012"/> While designed for [[convex optimization|convex problems]], AdaGrad has been successfully applied to non-convex optimization.<ref>{{cite journal |last1=Gupta |first1=Maya R. |first2=Samy |last2=Bengio |first3=Jason |last3=Weston |title=Training highly multiclass classifiers |journal=JMLR |volume=15 |issue=1 |year=2014 |pages=1461β1492 |url=http://jmlr.org/papers/volume15/gupta14a/gupta14a.pdf}}</ref>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)