Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Reinforcement learning
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==== Temporal difference methods ==== {{Main|Temporal difference learning}} The first problem is corrected by allowing the procedure to change the policy (at some or all states) before the values settle. This too may be problematic as it might prevent convergence. Most current algorithms do this, giving rise to the class of ''generalized policy iteration'' algorithms. Many [[Actor-critic algorithm|''actor-critic'' methods]] belong to this category. The second issue can be corrected by allowing trajectories to contribute to any state-action pair in them. This may also help to some extent with the third problem, although a better solution when returns have high variance is Sutton's [[temporal difference]] (TD) methods that are based on the recursive [[Bellman equation]].<ref>{{cite thesis|last = Sutton|first = Richard S.|title = Temporal Credit Assignment in Reinforcement Learning|degree = PhD|publisher = University of Massachusetts, Amherst, MA|url = http://incompleteideas.net/sutton/publications.html#PhDthesis|author-link = Richard S. Sutton|year = 1984|access-date = 2017-03-29|archive-date = 2017-03-30|archive-url = https://web.archive.org/web/20170330002227/http://incompleteideas.net/sutton/publications.html#PhDthesis|url-status = dead}}</ref>{{sfn|Sutton|Barto|2018|loc=[http://incompleteideas.net/sutton/book/ebook/node60.html Β§6. Temporal-Difference Learning]}} The computation in TD methods can be incremental (when after each transition the memory is changed and the transition is thrown away), or batch (when the transitions are batched and the estimates are computed once based on the batch). Batch methods, such as the least-squares temporal difference method,<ref>{{cite journal | doi = 10.1023/A:1018056104778 | last1 = Bradtke | first1 = Steven J. | author-link1 = Steven J. Bradtke | last2 = Barto | first2 = Andrew G. | author-link2 = Andrew G. Barto | title = Learning to predict by the method of temporal differences | journal = Machine Learning | volume = 22 | pages = 33β57 | year = 1996 | citeseerx = 10.1.1.143.857 | s2cid = 20327856 }}</ref> may use the information in the samples better, while incremental methods are the only choice when batch methods are infeasible due to their high computational or memory complexity. Some methods try to combine the two approaches. Methods based on temporal differences also overcome the fourth issue. Another problem specific to TD comes from their reliance on the recursive Bellman equation. Most TD methods have a so-called <math>\lambda</math> parameter <math>(0\le \lambda\le 1)</math> that can continuously interpolate between Monte Carlo methods that do not rely on the Bellman equations and the basic TD methods that rely entirely on the Bellman equations. This can be effective in palliating this issue.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)