Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Q-learning
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
{{Short description|Model-free reinforcement learning algorithm}} {{Machine learning|Reinforcement learning}} '''''Q''-learning''' is a [[reinforcement learning]] algorithm that trains an [[Intelligent agent|agent]] to assign values to its possible actions based on its current [[State (computer science)|state]], without requiring a model of the environment ([[Model-free (reinforcement learning)|model-free]]). It can handle problems with [[Stochastic matrix|stochastic transitions]] and rewards without requiring adaptations.<ref name="Li-2023">{{cite book |last1=Li |first1=Shengbo |title= Reinforcement Learning for Sequential Decision and Optimal Control |date=2023 |location=Springer Verlag, Singapore |isbn=978-9-811-97783-1 |pages=1–460 |doi=10.1007/978-981-19-7784-8 |s2cid=257928563 |edition=First | url=https://link.springer.com/book/10.1007/978-981-19-7784-8}}</ref> For example, in a grid maze, an agent learns to reach an exit worth 10 points. At a junction, Q-learning might assign a higher value to moving right than left if right gets to the exit faster, improving this choice by trying both directions over time. For any finite [[Markov decision process]], ''Q''-learning finds an optimal policy in the sense of maximizing the expected value of the total reward over any and all successive steps, starting from the current state.<ref name="auto">{{Cite web |last=Melo |first=Francisco S. |title=Convergence of Q-learning: a simple proof |url=http://users.isr.ist.utl.pt/~mtjspaan/readingGroup/ProofQlearning.pdf}}</ref> ''Q''-learning can identify an optimal [[action selection|action-selection]] policy for any given finite Markov decision process, given infinite exploration time and a partly random policy.<ref name="auto" /> "Q" refers to the function that the algorithm computes: the expected reward—that is, the ''quality''—of an action taken in a given state.<ref name=":0">{{Cite web |url=http://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/ |title=Demystifying Deep Reinforcement Learning |last=Matiisen |first=Tambet |date=December 19, 2015 |website=neuro.cs.ut.ee |publisher=Computational Neuroscience Lab |language=en-US |access-date=2018-04-06}}</ref>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)