Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Stochastic gradient descent
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Background== {{Main|M-estimation}} {{See also|Estimating equation}} Both [[statistics|statistical]] [[M-estimation|estimation]] and [[machine learning]] consider the problem of [[Mathematical optimization|minimizing]] an [[objective function]] that has the form of a sum: <math display="block">Q(w) = \frac{1}{n}\sum_{i=1}^n Q_i(w),</math> where the [[parametric statistics|parameter]] <math>w</math> that minimizes <math>Q(w)</math> is to be [[estimator|estimated]]. Each summand function <math>Q_i</math> is typically associated with the <math>i</math>-th [[Observation (statistics)|observation]] in the [[data set]] (used for training). In classical statistics, sum-minimization problems arise in [[least squares]] and in [[maximum-likelihood estimation]] (for independent observations). The general class of estimators that arise as minimizers of sums are called [[M-estimator]]s. However, in statistics, it has been long recognized that requiring even local minimization is too restrictive for some problems of maximum-likelihood estimation.<ref>{{cite journal | last = Ferguson | first = Thomas S.|author-link= Thomas S. Ferguson | title = An inconsistent maximum likelihood estimate | journal = Journal of the American Statistical Association | volume = 77 | issue = 380 | year = 1982 | pages = 831–834 | jstor = 2287314 | doi = 10.1080/01621459.1982.10477894 }}</ref> Therefore, contemporary statistical theorists often consider [[stationary point]]s of the [[likelihood function]] (or zeros of its derivative, the [[Score (statistics)|score function]], and other [[estimating equations]]). The sum-minimization problem also arises for [[empirical risk minimization]]. There, <math>Q_i(w)</math> is the value of the [[loss function]] at <math>i</math>-th example, and <math>Q(w)</math> is the empirical risk. When used to minimize the above function, a standard (or "batch") [[gradient descent]] method would perform the following iterations: <math display="block">w := w - \eta\,\nabla Q(w) = w - \frac{\eta}{n} \sum_{i=1}^n \nabla Q_i(w).</math> The step size is denoted by <math>\eta</math> (sometimes called the ''[[learning rate]]'' in machine learning) and here "<math>:=</math>" denotes the update of a variable in the algorithm. In many cases, the summand functions have a simple form that enables inexpensive evaluations of the sum-function and the sum gradient. For example, in statistics, [[exponential families|one-parameter exponential families]] allow economical function-evaluations and gradient-evaluations. However, in other cases, evaluating the sum-gradient may require expensive evaluations of the gradients from all summand functions. When the training set is enormous and no simple formulas exist, evaluating the sums of gradients becomes very expensive, because evaluating the gradient requires evaluating all the summand functions' gradients. To economize on the computational cost at every iteration, stochastic gradient descent [[sampling (statistics)|samples]] a subset of summand functions at every step. This is very effective in the case of large-scale machine learning problems.<ref>{{Cite conference |first1=Léon |last1=Bottou |author1-link=Léon Bottou |last2=Bousquet |first2=Olivier |title=The Tradeoffs of Large Scale Learning |url=http://leon.bottou.org/papers/bottou-bousquet-2008 |conference=[[Advances in Neural Information Processing Systems]] |volume=20 |pages=161–168 |year=2008}}</ref>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)