Editing Q-learning (section)

== Algorithm ==
[[File:Q-Learning Matrix Initialized and After Training.png|thumb|upright=2|A Q-learning table mapping states to actions, initially filled with zeros and updated iteratively through training]]
After <math>\Delta t</math> steps into the future the agent will decide some next step. The weight for this step is calculated as <math>\gamma^{\Delta t}</math>, where <math>\gamma</math> (the ''discount factor'') is a number between 0 and 1 (<math>0 \le \gamma \le 1</math>). Assuming <math>\gamma < 1</math>, it has the effect of valuing rewards received earlier higher than those received later (reflecting the value of a "good start"). <math> \gamma </math> may also be interpreted as the probability to succeed (or survive) at every step <math>\Delta t</math>.

The algorithm, therefore, has a function that calculates the quality of a state–action combination:

:<math>Q: \mathcal{S} \times \mathcal{A} \to \mathbb{R}</math>.

Before learning begins, {{tmath|Q}} is initialized to a possibly arbitrary fixed value (chosen by the programmer). Then, at each time <math>t</math> the agent selects an action <math>A_t</math>, observes a reward <math>R_{t+1}</math>, enters a new state <math>S_{t+1}</math> (that may depend on both the previous state <math>S_t</math> and the selected action), and <math>Q</math> is updated. The core of the algorithm is a [[Bellman equation]] as a simple [[Markov decision process#Value iteration|value iteration update]], using the weighted average of the current value and the new information:<ref>{{cite arXiv |last1=Dietterich |first1=Thomas G. |title=Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition |date=21 May 1999 |eprint=cs/9905014 }}</ref>

:<math>Q^{new}(S_{t},A_{t}) \leftarrow (1 - \underbrace{\alpha}_{\text{learning rate}}) \cdot \underbrace{Q(S_{t},A_{t})}_{\text{current value}} + \underbrace{\alpha}_{\text{learning rate}} \cdot \bigg( \underbrace{\underbrace{R_{t+1}}_{\text{reward}} + \underbrace{\gamma}_{\text{discount factor}} \cdot \underbrace{\max_{a}Q(S_{t+1}, a)}_{\text{estimate of optimal future value}}}_{\text{new value (temporal difference target)}} \bigg) </math>

where ''<math>R_{t+1}</math>'' is the reward received when moving from the state <math>S_{t}</math> to the state <math>S_{t+1}</math>, and <math>\alpha</math> is the [[learning rate]] <math>(0 < \alpha \le 1)</math>.

Note that <math>Q^{new}(S_t,A_t)</math> is the sum of three factors:

* <math>(1 - \alpha)Q(S_t,A_t)</math>: the current value (weighted by one minus the learning rate)
* <math>\alpha \, R_{t+1}</math>: the reward <math>R_{t+1}</math> to obtain if action <math>A_t</math> is taken when in state <math>S_t</math> (weighted by learning rate)
*<math>\alpha \gamma \max_{a}Q(S_{t+1},a)</math>: the maximum reward that can be obtained from state <math>S_{t+1}</math>(weighted by learning rate and discount factor)

An episode of the algorithm ends when state <math>S_{t+1}</math> is a final or ''terminal state''. However, ''Q''-learning can also learn in non-episodic tasks (as a result of the property of convergent infinite series). If the discount factor is lower than 1, the action values are finite even if the problem can contain infinite loops.

For all final states <math>s_f</math>, <math>Q(s_f, a)</math> is never updated, but is set to the reward value <math>r</math> observed for state <math>s_f</math>. In most cases, <math>Q(s_f,a)</math> can be taken to equal zero.