Editing Q-learning (section)

==Influence of variables==

=== Learning rate ===
The [[learning rate]] or ''step size'' determines to what extent newly acquired information overrides old information. A factor of 0 makes the agent learn nothing (exclusively exploiting prior knowledge), while a factor of 1 makes the agent consider only the most recent information (ignoring prior knowledge to explore possibilities). In fully [[Deterministic system|deterministic]] environments, a learning rate of <math>\alpha_t = 1</math> is optimal. When the problem is [[Stochastic systems|stochastic]], the algorithm converges under some technical conditions on the learning rate that require it to decrease to zero. In practice, often a constant learning rate is used, such as <math>\alpha_t = 0.1</math> for all <math>t</math>.<ref>{{Cite book |url=http://incompleteideas.net/sutton/book/ebook/the-book.html |title=Reinforcement Learning: An Introduction |last1=Sutton |first1=Richard |last2=Barto |first2=Andrew |date=1998 |publisher=MIT Press}}</ref>

=== Discount factor ===
The discount factor {{tmath|\gamma}} determines the importance of future rewards. A factor of 0 will make the agent "myopic" (or short-sighted) by only considering current rewards, i.e. <math>r_t</math> (in the update rule above), while a factor approaching 1 will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the action values may diverge. For {{tmath|\gamma {{=}} 1}}, without a terminal state, or if the agent never reaches one, all environment histories become infinitely long, and utilities with additive, undiscounted rewards generally become infinite.<ref>{{Cite book |title=Artificial Intelligence: A Modern Approach |last1=Russell |first1=Stuart J. |last2=Norvig |first2=Peter |date=2010 |publisher=[[Prentice Hall]] |isbn=978-0136042594 |edition=Third |page=649 |author-link=Stuart J. Russell |author-link2=Peter Norvig}}</ref> Even with a discount factor only slightly lower than 1, ''Q''-function learning leads to propagation of errors and instabilities when the value function is approximated with an [[artificial neural network]].<ref>{{cite journal|first=Leemon |last=Baird |title=Residual algorithms: Reinforcement learning with function approximation |url=http://www.leemon.com/papers/1995b.pdf |journal=ICML |pages= 30–37 |year=1995}}</ref> In that case, starting with a lower discount factor and increasing it towards its final value accelerates learning.<ref>{{cite arXiv|last1=François-Lavet|first1=Vincent|last2=Fonteneau|first2=Raphael|last3=Ernst|first3=Damien|date=2015-12-07|title=How to Discount Deep Reinforcement Learning: Towards New Dynamic Strategies|eprint=1512.02011 |class=cs.LG}}</ref>

=== Initial conditions (''Q''<sub>0</sub>) ===
Since ''Q''-learning is an iterative algorithm, it implicitly assumes an initial condition before the first update occurs. High initial values, also known as "optimistic initial conditions",<ref>{{Cite book |chapter-url=http://webdocs.cs.ualberta.ca/~sutton/book/ebook/node21.html |title=Reinforcement Learning: An Introduction |last1=Sutton |first1=Richard S. |last2=Barto |first2=Andrew G. |archive-url=https://web.archive.org/web/20130908031737/http://webdocs.cs.ualberta.ca/~sutton/book/ebook/node21.html |archive-date=2013-09-08 |url-status=dead |access-date=2013-07-18 |chapter=2.7 Optimistic Initial Values}}</ref> can encourage exploration: no matter what action is selected, the update rule will cause it to have lower values than the other alternative, thus increasing their choice probability. The first reward <math>r</math> can be used to reset the initial conditions.<ref name="hshteingart">{{Cite journal |last1=Shteingart |first1=Hanan |last2=Neiman |first2=Tal |last3=Loewenstein |first3=Yonatan |date=May 2013 |title=The role of first impression in operant learning. |url=https://shteingart.wordpress.com/wp-content/uploads/2008/02/the-role-of-first-impression-in-operant-learning.pdf |journal=Journal of Experimental Psychology: General |language=en |volume=142 |issue=2 |pages=476–488 |doi=10.1037/a0029550 |issn=1939-2222 |pmid=22924882}}</ref> According to this idea, the first time an action is taken the reward is used to set the value of <math>Q</math>. This allows immediate learning in case of fixed deterministic rewards. A model that incorporates ''reset of initial conditions'' (RIC) is expected to predict participants' behavior better than a model that assumes any ''arbitrary initial condition'' (AIC).<ref name="hshteingart" /> RIC seems to be consistent with human behaviour in repeated binary choice experiments.<ref name="hshteingart" />