Editing Temporal difference learning (section)

== Mathematical formulation ==
The tabular TD(0) method is one of the simplest TD methods. It is a special case of more general stochastic approximation methods. It estimates the [[Reinforcement_learning#Algorithms_for_control_learning|state value function]] of a finite-state [[Markov decision process]] (MDP) under a policy <math>\pi</math>. Let <math>V^\pi</math> denote the state value function of the MDP with states <math>(S_t)_{t\in\mathbb{N}}</math>, rewards <math>(R_t)_{t\in\mathbb{N}}</math> and discount rate<ref>Discount rate parameter allows for a [[time preference]] toward more immediate rewards, and away from distant future rewards</ref> <math>\gamma</math> under the policy <math> \pi </math>:{{sfnp|Sutton|Barto|2018|p=134}}
:<math>V^\pi(s) = E_{a \sim \pi}\left\{\sum_{t=0}^\infty \gamma^tR_{t+1}\Bigg| S_0=s\right\}.
</math>
We drop the action from the notation for convenience. <math>V^\pi</math> satisfies the [[Hamilton-Jacobi-Bellman equation|Hamilton-Jacobi-Bellman Equation]]: 
: <math>V^\pi(s)=E_{\pi}\{R_1 + \gamma V^\pi(S_1)|S_0=s\},</math>

so <math>R_1 + \gamma V^\pi(S_1)</math> is an unbiased estimate for <math>V^\pi(s)</math>. This observation motivates the following algorithm for estimating <math>V^\pi</math>.

The algorithm starts by initializing a table <math>V(s)</math> arbitrarily, with one value for each state of the MDP. A positive [[learning rate]] <math>\alpha</math> is chosen.

We then repeatedly evaluate the policy <math>\pi</math>, obtain a reward <math>r</math> and update the value function for the current state using the rule:{{sfnp|Sutton|Barto|2018|p=135}}
:<math> V(S_t) \leftarrow (1 - \alpha) V(S_t) + \underbrace{\alpha}_{\text{learning rate}} [ \overbrace{R_{t+1} + \gamma V(S_{t+1})}^{\text{The TD target}} ]</math>
where <math>S_t</math> and <math>S_{t+1}</math> are the current and next states, respectively. The value <math> R_{t+1} + \gamma V(S_{t+1})</math> is known as the TD target, and <math> R_{t+1} + \gamma V(S_{t+1}) - V(S_t)</math> is known as the TD error.