Editing Q-learning (section)

=== Discount factor ===
The discount factor {{tmath|\gamma}} determines the importance of future rewards. A factor of 0 will make the agent "myopic" (or short-sighted) by only considering current rewards, i.e. <math>r_t</math> (in the update rule above), while a factor approaching 1 will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the action values may diverge. For {{tmath|\gamma {{=}} 1}}, without a terminal state, or if the agent never reaches one, all environment histories become infinitely long, and utilities with additive, undiscounted rewards generally become infinite.<ref>{{Cite book |title=Artificial Intelligence: A Modern Approach |last1=Russell |first1=Stuart J. |last2=Norvig |first2=Peter |date=2010 |publisher=[[Prentice Hall]] |isbn=978-0136042594 |edition=Third |page=649 |author-link=Stuart J. Russell |author-link2=Peter Norvig}}</ref> Even with a discount factor only slightly lower than 1, ''Q''-function learning leads to propagation of errors and instabilities when the value function is approximated with an [[artificial neural network]].<ref>{{cite journal|first=Leemon |last=Baird |title=Residual algorithms: Reinforcement learning with function approximation |url=http://www.leemon.com/papers/1995b.pdf |journal=ICML |pages= 30–37 |year=1995}}</ref> In that case, starting with a lower discount factor and increasing it towards its final value accelerates learning.<ref>{{cite arXiv|last1=François-Lavet|first1=Vincent|last2=Fonteneau|first2=Raphael|last3=Ernst|first3=Damien|date=2015-12-07|title=How to Discount Deep Reinforcement Learning: Towards New Dynamic Strategies|eprint=1512.02011 |class=cs.LG}}</ref>