Editing Q-learning (section)

=== Initial conditions (''Q''<sub>0</sub>) ===
Since ''Q''-learning is an iterative algorithm, it implicitly assumes an initial condition before the first update occurs. High initial values, also known as "optimistic initial conditions",<ref>{{Cite book |chapter-url=http://webdocs.cs.ualberta.ca/~sutton/book/ebook/node21.html |title=Reinforcement Learning: An Introduction |last1=Sutton |first1=Richard S. |last2=Barto |first2=Andrew G. |archive-url=https://web.archive.org/web/20130908031737/http://webdocs.cs.ualberta.ca/~sutton/book/ebook/node21.html |archive-date=2013-09-08 |url-status=dead |access-date=2013-07-18 |chapter=2.7 Optimistic Initial Values}}</ref> can encourage exploration: no matter what action is selected, the update rule will cause it to have lower values than the other alternative, thus increasing their choice probability. The first reward <math>r</math> can be used to reset the initial conditions.<ref name="hshteingart">{{Cite journal |last1=Shteingart |first1=Hanan |last2=Neiman |first2=Tal |last3=Loewenstein |first3=Yonatan |date=May 2013 |title=The role of first impression in operant learning. |url=https://shteingart.wordpress.com/wp-content/uploads/2008/02/the-role-of-first-impression-in-operant-learning.pdf |journal=Journal of Experimental Psychology: General |language=en |volume=142 |issue=2 |pages=476–488 |doi=10.1037/a0029550 |issn=1939-2222 |pmid=22924882}}</ref> According to this idea, the first time an action is taken the reward is used to set the value of <math>Q</math>. This allows immediate learning in case of fixed deterministic rewards. A model that incorporates ''reset of initial conditions'' (RIC) is expected to predict participants' behavior better than a model that assumes any ''arbitrary initial condition'' (AIC).<ref name="hshteingart" /> RIC seems to be consistent with human behaviour in repeated binary choice experiments.<ref name="hshteingart" />