Editing Q-learning (section)

=== Double Q-learning ===
Because the future maximum approximated action value in Q-learning is evaluated using the same Q function as in current action selection policy, in noisy environments Q-learning can sometimes overestimate the action values, slowing the learning. A variant called Double Q-learning was proposed to correct this. Double Q-learning<ref>{{Cite journal |last=van Hasselt |first=Hado |year=2011 |title=Double Q-learning |url=http://papers.nips.cc/paper/3964-double-q-learning |format=PDF |journal=Advances in Neural Information Processing Systems |volume=23 |pages=2613–2622}}</ref> is an [[off-policy]] reinforcement learning algorithm, where a different policy is used for value evaluation than what is used to select the next action.

In practice, two separate value functions  <math>Q^A</math> and <math>Q^B</math> are trained in a mutually symmetric fashion using separate experiences. The double Q-learning update step is then as follows:
:<math>Q^A_{t+1}(s_{t}, a_{t}) = Q^A_{t}(s_{t}, a_{t}) + \alpha_{t}(s_{t}, a_{t}) \left(r_{t} + \gamma Q^B_{t}\left(s_{t+1}, \mathop\operatorname{arg~max}_{a} Q^A_t(s_{t+1}, a)\right) - Q^A_{t}(s_{t}, a_{t})\right)</math>, and
:<math>Q^B_{t+1}(s_{t}, a_{t}) = Q^B_{t}(s_{t}, a_{t}) + \alpha_{t}(s_{t}, a_{t}) \left(r_{t} + \gamma Q^A_{t}\left(s_{t+1}, \mathop\operatorname{arg~max}_{a} Q^B_t(s_{t+1}, a)\right) - Q^B_{t}(s_{t}, a_{t})\right).</math>

Now the estimated value of the discounted future is evaluated using a different policy, which solves the overestimation issue.

This algorithm was later modified in 2015 and combined with [[deep learning]],<ref>{{cite arXiv |last1=van Hasselt |first1=Hado |last2=Guez |first2=Arthur |last3=Silver |first3=David |title=Deep Reinforcement Learning with Double Q-learning |date=8 December 2015 |class=cs.LG |eprint=1509.06461 }}</ref> as in the DQN algorithm, resulting in Double DQN, which outperforms the original DQN algorithm.<ref>{{Cite journal |last1=van Hasselt |first1=Hado |last2=Guez |first2=Arthur |last3=Silver |first3=David |date=2015 |title=Deep reinforcement learning with double Q-learning |url=https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/download/12389/11847 |format=PDF |journal=AAAI Conference on Artificial Intelligence |pages=2094–2100|arxiv=1509.06461 }}</ref>