Editing Reinforcement learning (section)

==== Temporal difference methods ====
{{Main|Temporal difference learning}}
The first problem is corrected by allowing the procedure to change the policy (at some or all states) before the values settle. This too may be problematic as it might prevent convergence. Most current algorithms do this, giving rise to the class of ''generalized policy iteration'' algorithms. Many [[Actor-critic algorithm|''actor-critic'' methods]] belong to this category.

The second issue can be corrected by allowing trajectories to contribute to any state-action pair in them. This may also help to some extent with the third problem, although a better solution when returns have high variance is Sutton's [[temporal difference]] (TD) methods that are based on the recursive [[Bellman equation]].<ref>{{cite thesis|last = Sutton|first = Richard S.|title = Temporal Credit Assignment in Reinforcement Learning|degree = PhD|publisher = University of Massachusetts, Amherst, MA|url = http://incompleteideas.net/sutton/publications.html#PhDthesis|author-link = Richard S. Sutton|year = 1984|access-date = 2017-03-29|archive-date = 2017-03-30|archive-url = https://web.archive.org/web/20170330002227/http://incompleteideas.net/sutton/publications.html#PhDthesis|url-status = dead}}</ref>{{sfn|Sutton|Barto|2018|loc=[http://incompleteideas.net/sutton/book/ebook/node60.html §6. Temporal-Difference Learning]}} The computation in TD methods can be incremental (when after each transition the memory is changed and the transition is thrown away), or batch (when the transitions are batched and the estimates are computed once based on the batch). Batch methods, such as the least-squares temporal difference method,<ref>{{cite journal
| doi = 10.1023/A:1018056104778
| last1 = Bradtke | first1 = Steven J. | author-link1 = Steven J. Bradtke
| last2 = Barto | first2 = Andrew G. | author-link2 = Andrew G. Barto
| title = Learning to predict by the method of temporal differences
| journal = Machine Learning
| volume = 22
| pages = 33–57
| year = 1996
| citeseerx = 10.1.1.143.857 | s2cid = 20327856 }}</ref> may use the information in the samples better, while incremental methods are the only choice when batch methods are infeasible due to their high computational or memory complexity. Some methods try to combine the two approaches. Methods based on temporal differences also overcome the fourth issue.

Another problem specific to TD comes from their reliance on the recursive Bellman equation. Most TD methods have a so-called <math>\lambda</math> parameter <math>(0\le \lambda\le 1)</math> that can continuously interpolate between Monte Carlo methods that do not rely on the Bellman equations and the basic TD methods that rely entirely on the Bellman equations. This can be effective in palliating this issue.