Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Q-learning
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Algorithm == [[File:Q-Learning Matrix Initialized and After Training.png|thumb|upright=2|A Q-learning table mapping states to actions, initially filled with zeros and updated iteratively through training]] After <math>\Delta t</math> steps into the future the agent will decide some next step. The weight for this step is calculated as <math>\gamma^{\Delta t}</math>, where <math>\gamma</math> (the ''discount factor'') is a number between 0 and 1 (<math>0 \le \gamma \le 1</math>). Assuming <math>\gamma < 1</math>, it has the effect of valuing rewards received earlier higher than those received later (reflecting the value of a "good start"). <math> \gamma </math> may also be interpreted as the probability to succeed (or survive) at every step <math>\Delta t</math>. The algorithm, therefore, has a function that calculates the quality of a state–action combination: :<math>Q: \mathcal{S} \times \mathcal{A} \to \mathbb{R}</math>. Before learning begins, {{tmath|Q}} is initialized to a possibly arbitrary fixed value (chosen by the programmer). Then, at each time <math>t</math> the agent selects an action <math>A_t</math>, observes a reward <math>R_{t+1}</math>, enters a new state <math>S_{t+1}</math> (that may depend on both the previous state <math>S_t</math> and the selected action), and <math>Q</math> is updated. The core of the algorithm is a [[Bellman equation]] as a simple [[Markov decision process#Value iteration|value iteration update]], using the weighted average of the current value and the new information:<ref>{{cite arXiv |last1=Dietterich |first1=Thomas G. |title=Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition |date=21 May 1999 |eprint=cs/9905014 }}</ref> :<math>Q^{new}(S_{t},A_{t}) \leftarrow (1 - \underbrace{\alpha}_{\text{learning rate}}) \cdot \underbrace{Q(S_{t},A_{t})}_{\text{current value}} + \underbrace{\alpha}_{\text{learning rate}} \cdot \bigg( \underbrace{\underbrace{R_{t+1}}_{\text{reward}} + \underbrace{\gamma}_{\text{discount factor}} \cdot \underbrace{\max_{a}Q(S_{t+1}, a)}_{\text{estimate of optimal future value}}}_{\text{new value (temporal difference target)}} \bigg) </math> where ''<math>R_{t+1}</math>'' is the reward received when moving from the state <math>S_{t}</math> to the state <math>S_{t+1}</math>, and <math>\alpha</math> is the [[learning rate]] <math>(0 < \alpha \le 1)</math>. Note that <math>Q^{new}(S_t,A_t)</math> is the sum of three factors: * <math>(1 - \alpha)Q(S_t,A_t)</math>: the current value (weighted by one minus the learning rate) * <math>\alpha \, R_{t+1}</math>: the reward <math>R_{t+1}</math> to obtain if action <math>A_t</math> is taken when in state <math>S_t</math> (weighted by learning rate) *<math>\alpha \gamma \max_{a}Q(S_{t+1},a)</math>: the maximum reward that can be obtained from state <math>S_{t+1}</math>(weighted by learning rate and discount factor) An episode of the algorithm ends when state <math>S_{t+1}</math> is a final or ''terminal state''. However, ''Q''-learning can also learn in non-episodic tasks (as a result of the property of convergent infinite series). If the discount factor is lower than 1, the action values are finite even if the problem can contain infinite loops. For all final states <math>s_f</math>, <math>Q(s_f, a)</math> is never updated, but is set to the reward value <math>r</math> observed for state <math>s_f</math>. In most cases, <math>Q(s_f,a)</math> can be taken to equal zero.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)