Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Q-learning
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== Initial conditions (''Q''<sub>0</sub>) === Since ''Q''-learning is an iterative algorithm, it implicitly assumes an initial condition before the first update occurs. High initial values, also known as "optimistic initial conditions",<ref>{{Cite book |chapter-url=http://webdocs.cs.ualberta.ca/~sutton/book/ebook/node21.html |title=Reinforcement Learning: An Introduction |last1=Sutton |first1=Richard S. |last2=Barto |first2=Andrew G. |archive-url=https://web.archive.org/web/20130908031737/http://webdocs.cs.ualberta.ca/~sutton/book/ebook/node21.html |archive-date=2013-09-08 |url-status=dead |access-date=2013-07-18 |chapter=2.7 Optimistic Initial Values}}</ref> can encourage exploration: no matter what action is selected, the update rule will cause it to have lower values than the other alternative, thus increasing their choice probability. The first reward <math>r</math> can be used to reset the initial conditions.<ref name="hshteingart">{{Cite journal |last1=Shteingart |first1=Hanan |last2=Neiman |first2=Tal |last3=Loewenstein |first3=Yonatan |date=May 2013 |title=The role of first impression in operant learning. |url=https://shteingart.wordpress.com/wp-content/uploads/2008/02/the-role-of-first-impression-in-operant-learning.pdf |journal=Journal of Experimental Psychology: General |language=en |volume=142 |issue=2 |pages=476β488 |doi=10.1037/a0029550 |issn=1939-2222 |pmid=22924882}}</ref> According to this idea, the first time an action is taken the reward is used to set the value of <math>Q</math>. This allows immediate learning in case of fixed deterministic rewards. A model that incorporates ''reset of initial conditions'' (RIC) is expected to predict participants' behavior better than a model that assumes any ''arbitrary initial condition'' (AIC).<ref name="hshteingart" /> RIC seems to be consistent with human behaviour in repeated binary choice experiments.<ref name="hshteingart" />
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)