Editing Q-learning (section)

== Variants ==

=== Deep Q-learning ===
The DeepMind system used a deep [[convolutional neural network]], with layers of tiled [[convolution]]al filters to mimic the effects of receptive fields. Reinforcement learning is unstable or divergent when a nonlinear function approximator such as a neural network is used to represent Q. This instability comes from the correlations present in the sequence of observations, the fact that small updates to Q may significantly change the policy of the agent and the data distribution, and the correlations between Q and the target values. The method can be used for stochastic search in various domains and applications.<ref name="Li-2023"/><ref name="MBK">{{Cite journal |author1 = Matzliach B. |author2 = Ben-Gal I. |author3 = Kagan E. |title = Detection of Static and Mobile Targets by an Autonomous Agent with Deep Q-Learning Abilities| journal=Entropy | year=2022 | volume=24 | issue=8 | page=1168 |url =  http://www.eng.tau.ac.il/~bengal/DeepQ_MBK_2023.pdf | doi=10.3390/e24081168 | pmid=36010832 | pmc=9407070 | bibcode=2022Entrp..24.1168M | doi-access=free }}</ref>

The technique used ''experience replay,'' a biologically inspired mechanism that uses a random sample of prior actions instead of the most recent action to proceed.<ref name=":0" /> This removes correlations in the observation sequence and smooths changes in the data distribution. Iterative updates adjust Q towards target values that are only periodically updated, further reducing correlations with the target.<ref name="DQN">{{Cite journal |last1=Mnih |first1=Volodymyr |last2=Kavukcuoglu |first2=Koray |last3=Silver |first3=David |last4=Rusu |first4=Andrei A. |last5=Veness |first5=Joel |last6=Bellemare |first6=Marc G. |last7=Graves |first7=Alex |last8=Riedmiller |first8=Martin |last9=Fidjeland |first9=Andreas K. |date=Feb 2015 |title=Human-level control through deep reinforcement learning |journal=Nature |language=en |volume=518 |issue=7540 |pages=529–533 |doi=10.1038/nature14236 |pmid=25719670 |bibcode=2015Natur.518..529M |s2cid=205242740 |issn=0028-0836}}</ref>

=== Double Q-learning ===
Because the future maximum approximated action value in Q-learning is evaluated using the same Q function as in current action selection policy, in noisy environments Q-learning can sometimes overestimate the action values, slowing the learning. A variant called Double Q-learning was proposed to correct this. Double Q-learning<ref>{{Cite journal |last=van Hasselt |first=Hado |year=2011 |title=Double Q-learning |url=http://papers.nips.cc/paper/3964-double-q-learning |format=PDF |journal=Advances in Neural Information Processing Systems |volume=23 |pages=2613–2622}}</ref> is an [[off-policy]] reinforcement learning algorithm, where a different policy is used for value evaluation than what is used to select the next action.

In practice, two separate value functions  <math>Q^A</math> and <math>Q^B</math> are trained in a mutually symmetric fashion using separate experiences. The double Q-learning update step is then as follows:
:<math>Q^A_{t+1}(s_{t}, a_{t}) = Q^A_{t}(s_{t}, a_{t}) + \alpha_{t}(s_{t}, a_{t}) \left(r_{t} + \gamma Q^B_{t}\left(s_{t+1}, \mathop\operatorname{arg~max}_{a} Q^A_t(s_{t+1}, a)\right) - Q^A_{t}(s_{t}, a_{t})\right)</math>, and
:<math>Q^B_{t+1}(s_{t}, a_{t}) = Q^B_{t}(s_{t}, a_{t}) + \alpha_{t}(s_{t}, a_{t}) \left(r_{t} + \gamma Q^A_{t}\left(s_{t+1}, \mathop\operatorname{arg~max}_{a} Q^B_t(s_{t+1}, a)\right) - Q^B_{t}(s_{t}, a_{t})\right).</math>

Now the estimated value of the discounted future is evaluated using a different policy, which solves the overestimation issue.

This algorithm was later modified in 2015 and combined with [[deep learning]],<ref>{{cite arXiv |last1=van Hasselt |first1=Hado |last2=Guez |first2=Arthur |last3=Silver |first3=David |title=Deep Reinforcement Learning with Double Q-learning |date=8 December 2015 |class=cs.LG |eprint=1509.06461 }}</ref> as in the DQN algorithm, resulting in Double DQN, which outperforms the original DQN algorithm.<ref>{{Cite journal |last1=van Hasselt |first1=Hado |last2=Guez |first2=Arthur |last3=Silver |first3=David |date=2015 |title=Deep reinforcement learning with double Q-learning |url=https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/download/12389/11847 |format=PDF |journal=AAAI Conference on Artificial Intelligence |pages=2094–2100|arxiv=1509.06461 }}</ref>

=== Others ===
Delayed Q-learning is an alternative implementation of the online ''Q''-learning algorithm, with [[Probably approximately correct learning|probably approximately correct (PAC) learning]].<ref>{{Cite journal |last1=Strehl |first1=Alexander L. |last2=Li |first2=Lihong |last3=Wiewiora |first3=Eric |last4=Langford |first4=John |last5=Littman |first5=Michael L. |year=2006 |title=Pac model-free reinforcement learning |url=https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/published-14.pdf |journal=Proc. 22nd ICML |pages=881–888}}</ref>

Greedy GQ is a variant of ''Q''-learning to use in combination with (linear) function approximation.<ref>{{cite web |first1=Hamid |last1=Maei |first2=Csaba |last2=Szepesvári |first3=Shalabh |last3=Bhatnagar |first4=Richard |last4=Sutton |url=https://webdocs.cs.ualberta.ca/~sutton/papers/MSBS-10.pdf |title=Toward off-policy learning control with function approximation in Proceedings of the 27th International Conference on Machine Learning |pages=719–726 |year=2010 |access-date=2016-01-25 |archive-url=https://web.archive.org/web/20120908050052/http://webdocs.cs.ualberta.ca/~sutton/papers/MSBS-10.pdf |archive-date=2012-09-08 |url-status=dead }}</ref> The advantage of Greedy GQ is that convergence is guaranteed even when function approximation is used to estimate the action values.

Distributional Q-learning is a variant of ''Q''-learning which seeks to model the distribution of returns rather than the expected return of each action. It has been observed to facilitate estimate by deep neural networks and can enable alternative control methods, such as risk-sensitive control.<ref>{{cite journal |last1=Hessel |first1=Matteo |last2=Modayil |first2=Joseph |last3=van Hasselt |first3=Hado |last4=Schaul |first4=Tom |last5=Ostrovski |first5=Georg |last6=Dabney |first6=Will |last7=Horgan |first7=Dan |last8=Piot |first8=Bilal |last9=Azar |first9=Mohammad |last10=Silver |first10=David |title=Rainbow: Combining Improvements in Deep Reinforcement Learning |journal=Proceedings of the AAAI Conference on Artificial Intelligence |date=February 2018 |volume=32 |doi=10.1609/aaai.v32i1.11796 |arxiv=1710.02298 |s2cid=19135734 }}</ref>

=== Multi-agent learning ===
Q-learning has been proposed in the multi-agent setting (see Section 4.1.2 in <ref>{{cite journal |last1=Shoham |first1=Yoav |last2=Powers |first2=Rob |last3=Grenager |first3=Trond |title=If multi-agent learning is the answer, what is the question? |journal=Artificial Intelligence |date=1 May 2007 |volume=171 |issue=7 |pages=365–377 |doi=10.1016/j.artint.2006.02.006 |url=https://dl.acm.org/doi/10.1016/j.artint.2006.02.006 |access-date=4 April 2023 |issn=0004-3702}}</ref>). One approach consists in pretending the environment is passive.<ref>{{cite journal |last1=Sen |first1=Sandip |last2=Sekaran |first2=Mahendra |last3=Hale |first3=John |title=Learning to coordinate without sharing information |journal=Proceedings of the Twelfth AAAI National Conference on Artificial Intelligence |date=1 August 1994 |pages=426–431 |url=https://dl.acm.org/doi/10.5555/2891730.2891796 |access-date=4 April 2023 |publisher=AAAI Press}}</ref> Littman proposes the minimax Q learning algorithm.<ref>{{cite journal |last1=Littman |first1=Michael L. |title=Markov games as a framework for multi-agent reinforcement learning |journal=Proceedings of the Eleventh International Conference on International Conference on Machine Learning |date=10 July 1994 |pages=157–163 |url=https://dl.acm.org/doi/10.5555/3091574.3091594 |access-date=4 April 2023 |publisher=Morgan Kaufmann Publishers Inc.|isbn=9781558603356 }}</ref>