Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Q-learning
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== Double Q-learning === Because the future maximum approximated action value in Q-learning is evaluated using the same Q function as in current action selection policy, in noisy environments Q-learning can sometimes overestimate the action values, slowing the learning. A variant called Double Q-learning was proposed to correct this. Double Q-learning<ref>{{Cite journal |last=van Hasselt |first=Hado |year=2011 |title=Double Q-learning |url=http://papers.nips.cc/paper/3964-double-q-learning |format=PDF |journal=Advances in Neural Information Processing Systems |volume=23 |pages=2613β2622}}</ref> is an [[off-policy]] reinforcement learning algorithm, where a different policy is used for value evaluation than what is used to select the next action. In practice, two separate value functions <math>Q^A</math> and <math>Q^B</math> are trained in a mutually symmetric fashion using separate experiences. The double Q-learning update step is then as follows: :<math>Q^A_{t+1}(s_{t}, a_{t}) = Q^A_{t}(s_{t}, a_{t}) + \alpha_{t}(s_{t}, a_{t}) \left(r_{t} + \gamma Q^B_{t}\left(s_{t+1}, \mathop\operatorname{arg~max}_{a} Q^A_t(s_{t+1}, a)\right) - Q^A_{t}(s_{t}, a_{t})\right)</math>, and :<math>Q^B_{t+1}(s_{t}, a_{t}) = Q^B_{t}(s_{t}, a_{t}) + \alpha_{t}(s_{t}, a_{t}) \left(r_{t} + \gamma Q^A_{t}\left(s_{t+1}, \mathop\operatorname{arg~max}_{a} Q^B_t(s_{t+1}, a)\right) - Q^B_{t}(s_{t}, a_{t})\right).</math> Now the estimated value of the discounted future is evaluated using a different policy, which solves the overestimation issue. This algorithm was later modified in 2015 and combined with [[deep learning]],<ref>{{cite arXiv |last1=van Hasselt |first1=Hado |last2=Guez |first2=Arthur |last3=Silver |first3=David |title=Deep Reinforcement Learning with Double Q-learning |date=8 December 2015 |class=cs.LG |eprint=1509.06461 }}</ref> as in the DQN algorithm, resulting in Double DQN, which outperforms the original DQN algorithm.<ref>{{Cite journal |last1=van Hasselt |first1=Hado |last2=Guez |first2=Arthur |last3=Silver |first3=David |date=2015 |title=Deep reinforcement learning with double Q-learning |url=https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/download/12389/11847 |format=PDF |journal=AAAI Conference on Artificial Intelligence |pages=2094β2100|arxiv=1509.06461 }}</ref>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)