Editing Stochastic gradient descent (section)

===AdaGrad===
''AdaGrad'' (for adaptive [[Gradient descent|gradient]] algorithm) is a modified stochastic gradient descent algorithm with per-parameter [[learning rate]], first published in 2011.<ref name="duchi">{{cite journal |last1=Duchi |first1=John |first2=Elad |last2=Hazan |first3=Yoram |last3=Singer |title=Adaptive subgradient methods for online learning and stochastic optimization |journal=[[Journal of Machine Learning Research|JMLR]] |volume=12 |year=2011 |pages=2121–2159 |url=http://jmlr.org/papers/volume12/duchi11a/duchi11a.pdf}}</ref> Informally, this increases the learning rate for {{clarify|text=sparser parameters|date=November 2023}} and decreases the learning rate for ones that are less sparse. This strategy often improves convergence performance over standard stochastic gradient descent in settings where data is sparse and sparse parameters are more informative. Examples of such applications include natural language processing and image recognition.<ref name="duchi"/>

It still has a base learning rate {{mvar|η}}, but this is multiplied with the elements of a vector {{math|{''G''<sub>''j'',''j''</sub>} }} which is the diagonal of the [[outer product]] matrix

<math display="block">G = \sum_{\tau=1}^t g_\tau g_\tau^\mathsf{T}</math>

where <math>g_\tau = \nabla Q_i(w)</math>, the gradient, at iteration {{mvar|τ}}. The diagonal is given by

<math display="block">G_{j,j} = \sum_{\tau=1}^t g_{\tau,j}^2.</math>This vector essentially stores a historical sum of gradient squares by dimension and is updated after every iteration. The formula for an update is now{{efn|<math>\odot</math> denotes the [[Hadamard product (matrices)|element-wise product]].}}
<math display="block">w := w - \eta\, \mathrm{diag}(G)^{-\frac{1}{2}} \odot g</math>
or, written as per-parameter updates,
<math display="block">w_j := w_j - \frac{\eta}{\sqrt{G_{j,j}}} g_j.</math>
Each {{math|{''G''<sub>(''i'',''i'')</sub>} }} gives rise to a scaling factor for the learning rate that applies to a single parameter {{math|''w''<sub>''i''</sub>}}. Since the denominator in this factor, <math display="inline">\sqrt{G_i} = \sqrt{\sum_{\tau=1}^t g_\tau^2}</math> is the [[Norm (mathematics)#Euclidean norm|''ℓ''<sub>2</sub> norm]] of previous derivatives, extreme parameter updates get dampened, while parameters that get few or small updates receive higher learning rates.<ref name="Zeiler 2012"/>

While designed for [[convex optimization|convex problems]], AdaGrad has been successfully applied to non-convex optimization.<ref>{{cite journal |last1=Gupta |first1=Maya R. |first2=Samy |last2=Bengio |first3=Jason |last3=Weston |title=Training highly multiclass classifiers |journal=JMLR |volume=15 |issue=1 |year=2014 |pages=1461–1492 |url=http://jmlr.org/papers/volume15/gupta14a/gupta14a.pdf}}</ref>