Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Temporal difference learning
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Mathematical formulation == The tabular TD(0) method is one of the simplest TD methods. It is a special case of more general stochastic approximation methods. It estimates the [[Reinforcement_learning#Algorithms_for_control_learning|state value function]] of a finite-state [[Markov decision process]] (MDP) under a policy <math>\pi</math>. Let <math>V^\pi</math> denote the state value function of the MDP with states <math>(S_t)_{t\in\mathbb{N}}</math>, rewards <math>(R_t)_{t\in\mathbb{N}}</math> and discount rate<ref>Discount rate parameter allows for a [[time preference]] toward more immediate rewards, and away from distant future rewards</ref> <math>\gamma</math> under the policy <math> \pi </math>:{{sfnp|Sutton|Barto|2018|p=134}} :<math>V^\pi(s) = E_{a \sim \pi}\left\{\sum_{t=0}^\infty \gamma^tR_{t+1}\Bigg| S_0=s\right\}. </math> We drop the action from the notation for convenience. <math>V^\pi</math> satisfies the [[Hamilton-Jacobi-Bellman equation|Hamilton-Jacobi-Bellman Equation]]: : <math>V^\pi(s)=E_{\pi}\{R_1 + \gamma V^\pi(S_1)|S_0=s\},</math> so <math>R_1 + \gamma V^\pi(S_1)</math> is an unbiased estimate for <math>V^\pi(s)</math>. This observation motivates the following algorithm for estimating <math>V^\pi</math>. The algorithm starts by initializing a table <math>V(s)</math> arbitrarily, with one value for each state of the MDP. A positive [[learning rate]] <math>\alpha</math> is chosen. We then repeatedly evaluate the policy <math>\pi</math>, obtain a reward <math>r</math> and update the value function for the current state using the rule:{{sfnp|Sutton|Barto|2018|p=135}} :<math> V(S_t) \leftarrow (1 - \alpha) V(S_t) + \underbrace{\alpha}_{\text{learning rate}} [ \overbrace{R_{t+1} + \gamma V(S_{t+1})}^{\text{The TD target}} ]</math> where <math>S_t</math> and <math>S_{t+1}</math> are the current and next states, respectively. The value <math> R_{t+1} + \gamma V(S_{t+1})</math> is known as the TD target, and <math> R_{t+1} + \gamma V(S_{t+1}) - V(S_t)</math> is known as the TD error.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)