Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Reinforcement learning
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== Criterion of optimality === ==== Policy ==== The agent's action selection is modeled as a map called ''policy'': :<math>\pi: \mathcal{A} \times \mathcal{S} \rightarrow [0,1]</math> :<math>\pi(a,s) = \Pr(A_t = a \mid S_t = s)</math> The policy map gives the probability of taking action <math>a</math> when in state <math>s</math>.<ref name=":0">{{Cite web|url=http://people.inf.elte.hu/lorincz/Files/RL_2006/SuttonBook.pdf|title=Reinforcement learning: An introduction|access-date=2017-07-23|archive-date=2017-07-12|archive-url=https://web.archive.org/web/20170712170739/http://people.inf.elte.hu/lorincz/Files/RL_2006/SuttonBook.pdf|url-status=dead}}</ref>{{Rp|61}} There are also deterministic policies <math>\pi</math> for which <math>\pi(s)</math> denotes the action that should be played at state <math>s</math>. ==== State-value function ==== The state-value function <math>V_\pi(s)</math> is defined as, ''expected discounted return'' starting with state <math>s</math>, i.e. <math>S_0 = s</math>, and successively following policy <math>\pi</math>. Hence, roughly speaking, the value function estimates "how good" it is to be in a given state.<ref name=":0" />{{Rp|60}} :<math>V_\pi(s) = \operatorname \mathbb{E}[G\mid S_0 = s] = \operatorname \mathbb{E}\left[\sum_{t=0}^\infty \gamma^t R_{t+1}\mid S_0 = s\right],</math> where the random variable <math>G</math> denotes the '''discounted return''', and is defined as the sum of future discounted rewards: :<math>G=\sum_{t=0}^\infty \gamma^t R_{t+1}=R_1 + \gamma R_2 + \gamma^2 R_3 + \dots,</math> where <math>R_{t+1}</math> is the reward for transitioning from state <math>S_t</math> to <math>S_{t+1}</math>, <math>0 \le \gamma<1</math> is the [[Q-learning#Discount factor|discount rate]]. <math>\gamma</math> is less than 1, so rewards in the distant future are weighted less than rewards in the immediate future. The algorithm must find a policy with maximum expected discounted return. From the theory of Markov decision processes it is known that, without loss of generality, the search can be restricted to the set of so-called ''stationary'' policies. A policy is ''stationary'' if the action-distribution returned by it depends only on the last state visited (from the observation agent's history). The search can be further restricted to ''deterministic'' stationary policies. A ''deterministic stationary'' policy deterministically selects actions based on the current state. Since any such policy can be identified with a mapping from the set of states to the set of actions, these policies can be identified with such mappings with no loss of generality.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)