Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Q-learning
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Variants == === Deep Q-learning === The DeepMind system used a deep [[convolutional neural network]], with layers of tiled [[convolution]]al filters to mimic the effects of receptive fields. Reinforcement learning is unstable or divergent when a nonlinear function approximator such as a neural network is used to represent Q. This instability comes from the correlations present in the sequence of observations, the fact that small updates to Q may significantly change the policy of the agent and the data distribution, and the correlations between Q and the target values. The method can be used for stochastic search in various domains and applications.<ref name="Li-2023"/><ref name="MBK">{{Cite journal |author1 = Matzliach B. |author2 = Ben-Gal I. |author3 = Kagan E. |title = Detection of Static and Mobile Targets by an Autonomous Agent with Deep Q-Learning Abilities| journal=Entropy | year=2022 | volume=24 | issue=8 | page=1168 |url = http://www.eng.tau.ac.il/~bengal/DeepQ_MBK_2023.pdf | doi=10.3390/e24081168 | pmid=36010832 | pmc=9407070 | bibcode=2022Entrp..24.1168M | doi-access=free }}</ref> The technique used ''experience replay,'' a biologically inspired mechanism that uses a random sample of prior actions instead of the most recent action to proceed.<ref name=":0" /> This removes correlations in the observation sequence and smooths changes in the data distribution. Iterative updates adjust Q towards target values that are only periodically updated, further reducing correlations with the target.<ref name="DQN">{{Cite journal |last1=Mnih |first1=Volodymyr |last2=Kavukcuoglu |first2=Koray |last3=Silver |first3=David |last4=Rusu |first4=Andrei A. |last5=Veness |first5=Joel |last6=Bellemare |first6=Marc G. |last7=Graves |first7=Alex |last8=Riedmiller |first8=Martin |last9=Fidjeland |first9=Andreas K. |date=Feb 2015 |title=Human-level control through deep reinforcement learning |journal=Nature |language=en |volume=518 |issue=7540 |pages=529β533 |doi=10.1038/nature14236 |pmid=25719670 |bibcode=2015Natur.518..529M |s2cid=205242740 |issn=0028-0836}}</ref> === Double Q-learning === Because the future maximum approximated action value in Q-learning is evaluated using the same Q function as in current action selection policy, in noisy environments Q-learning can sometimes overestimate the action values, slowing the learning. A variant called Double Q-learning was proposed to correct this. Double Q-learning<ref>{{Cite journal |last=van Hasselt |first=Hado |year=2011 |title=Double Q-learning |url=http://papers.nips.cc/paper/3964-double-q-learning |format=PDF |journal=Advances in Neural Information Processing Systems |volume=23 |pages=2613β2622}}</ref> is an [[off-policy]] reinforcement learning algorithm, where a different policy is used for value evaluation than what is used to select the next action. In practice, two separate value functions <math>Q^A</math> and <math>Q^B</math> are trained in a mutually symmetric fashion using separate experiences. The double Q-learning update step is then as follows: :<math>Q^A_{t+1}(s_{t}, a_{t}) = Q^A_{t}(s_{t}, a_{t}) + \alpha_{t}(s_{t}, a_{t}) \left(r_{t} + \gamma Q^B_{t}\left(s_{t+1}, \mathop\operatorname{arg~max}_{a} Q^A_t(s_{t+1}, a)\right) - Q^A_{t}(s_{t}, a_{t})\right)</math>, and :<math>Q^B_{t+1}(s_{t}, a_{t}) = Q^B_{t}(s_{t}, a_{t}) + \alpha_{t}(s_{t}, a_{t}) \left(r_{t} + \gamma Q^A_{t}\left(s_{t+1}, \mathop\operatorname{arg~max}_{a} Q^B_t(s_{t+1}, a)\right) - Q^B_{t}(s_{t}, a_{t})\right).</math> Now the estimated value of the discounted future is evaluated using a different policy, which solves the overestimation issue. This algorithm was later modified in 2015 and combined with [[deep learning]],<ref>{{cite arXiv |last1=van Hasselt |first1=Hado |last2=Guez |first2=Arthur |last3=Silver |first3=David |title=Deep Reinforcement Learning with Double Q-learning |date=8 December 2015 |class=cs.LG |eprint=1509.06461 }}</ref> as in the DQN algorithm, resulting in Double DQN, which outperforms the original DQN algorithm.<ref>{{Cite journal |last1=van Hasselt |first1=Hado |last2=Guez |first2=Arthur |last3=Silver |first3=David |date=2015 |title=Deep reinforcement learning with double Q-learning |url=https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/download/12389/11847 |format=PDF |journal=AAAI Conference on Artificial Intelligence |pages=2094β2100|arxiv=1509.06461 }}</ref> === Others === Delayed Q-learning is an alternative implementation of the online ''Q''-learning algorithm, with [[Probably approximately correct learning|probably approximately correct (PAC) learning]].<ref>{{Cite journal |last1=Strehl |first1=Alexander L. |last2=Li |first2=Lihong |last3=Wiewiora |first3=Eric |last4=Langford |first4=John |last5=Littman |first5=Michael L. |year=2006 |title=Pac model-free reinforcement learning |url=https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/published-14.pdf |journal=Proc. 22nd ICML |pages=881β888}}</ref> Greedy GQ is a variant of ''Q''-learning to use in combination with (linear) function approximation.<ref>{{cite web |first1=Hamid |last1=Maei |first2=Csaba |last2=SzepesvΓ‘ri |first3=Shalabh |last3=Bhatnagar |first4=Richard |last4=Sutton |url=https://webdocs.cs.ualberta.ca/~sutton/papers/MSBS-10.pdf |title=Toward off-policy learning control with function approximation in Proceedings of the 27th International Conference on Machine Learning |pages=719β726 |year=2010 |access-date=2016-01-25 |archive-url=https://web.archive.org/web/20120908050052/http://webdocs.cs.ualberta.ca/~sutton/papers/MSBS-10.pdf |archive-date=2012-09-08 |url-status=dead }}</ref> The advantage of Greedy GQ is that convergence is guaranteed even when function approximation is used to estimate the action values. Distributional Q-learning is a variant of ''Q''-learning which seeks to model the distribution of returns rather than the expected return of each action. It has been observed to facilitate estimate by deep neural networks and can enable alternative control methods, such as risk-sensitive control.<ref>{{cite journal |last1=Hessel |first1=Matteo |last2=Modayil |first2=Joseph |last3=van Hasselt |first3=Hado |last4=Schaul |first4=Tom |last5=Ostrovski |first5=Georg |last6=Dabney |first6=Will |last7=Horgan |first7=Dan |last8=Piot |first8=Bilal |last9=Azar |first9=Mohammad |last10=Silver |first10=David |title=Rainbow: Combining Improvements in Deep Reinforcement Learning |journal=Proceedings of the AAAI Conference on Artificial Intelligence |date=February 2018 |volume=32 |doi=10.1609/aaai.v32i1.11796 |arxiv=1710.02298 |s2cid=19135734 }}</ref> === Multi-agent learning === Q-learning has been proposed in the multi-agent setting (see Section 4.1.2 in <ref>{{cite journal |last1=Shoham |first1=Yoav |last2=Powers |first2=Rob |last3=Grenager |first3=Trond |title=If multi-agent learning is the answer, what is the question? |journal=Artificial Intelligence |date=1 May 2007 |volume=171 |issue=7 |pages=365β377 |doi=10.1016/j.artint.2006.02.006 |url=https://dl.acm.org/doi/10.1016/j.artint.2006.02.006 |access-date=4 April 2023 |issn=0004-3702}}</ref>). One approach consists in pretending the environment is passive.<ref>{{cite journal |last1=Sen |first1=Sandip |last2=Sekaran |first2=Mahendra |last3=Hale |first3=John |title=Learning to coordinate without sharing information |journal=Proceedings of the Twelfth AAAI National Conference on Artificial Intelligence |date=1 August 1994 |pages=426β431 |url=https://dl.acm.org/doi/10.5555/2891730.2891796 |access-date=4 April 2023 |publisher=AAAI Press}}</ref> Littman proposes the minimax Q learning algorithm.<ref>{{cite journal |last1=Littman |first1=Michael L. |title=Markov games as a framework for multi-agent reinforcement learning |journal=Proceedings of the Eleventh International Conference on International Conference on Machine Learning |date=10 July 1994 |pages=157β163 |url=https://dl.acm.org/doi/10.5555/3091574.3091594 |access-date=4 April 2023 |publisher=Morgan Kaufmann Publishers Inc.|isbn=9781558603356 }}</ref>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)