Editing Stochastic gradient descent (section)

===RMSProp===
''RMSProp'' (for Root Mean Square Propagation) is a method invented in 2012 by James Martens and [[Ilya Sutskever]], at the time both PhD students in Geoffrey Hinton's group, in which the [[learning rate]] is, like in Adagrad, adapted for each of the parameters. The idea is to divide the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight.<ref name=rmsprop>{{Cite web|url=http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf|title=Lecture 6e rmsprop: Divide the gradient by a running average of its recent magnitude|last=Hinton|first=Geoffrey|author-link=Geoffrey Hinton|pages=26|access-date=19 March 2020}}</ref> Unusually, it was not published in an article but merely described in a [[Coursera]] lecture.{{citation needed|date=June 2023}}
Citation 1: https://deepai.org/machine-learning-glossary-and-terms/rmsprop#:~:text=The%20RMSProp%20algorithm%20was%20introduced,its%20effectiveness%20in%20various%20applications.
Citation 2: this video at 36:37  https://www.youtube.com/watch?v=-eyhCTvrEtE&t=36m37s

So, first the running average is calculated in terms of means square,

<math display="block">v(w,t):=\gamma v(w,t-1) + \left(1-\gamma\right) \left(\nabla Q_i(w)\right)^2</math>

where, <math>\gamma</math> is the forgetting factor. The concept of storing the historical gradient as sum of squares is borrowed from Adagrad, but "forgetting" is introduced to solve Adagrad's diminishing learning rates in non-convex problems by gradually decreasing the influence of old data.{{cn|date=June 2024}}

And the parameters are updated as,

<math display="block">w:=w-\frac{\eta}{\sqrt{v(w,t)}}\nabla Q_i(w)</math>

RMSProp has shown good adaptation of learning rate in different applications. RMSProp can be seen as a generalization of [[Rprop]] and is capable to work with mini-batches as well opposed to only full-batches.<ref name="rmsprop" />