Editing Reinforcement learning (section)

=== Criterion of optimality ===
==== Policy ====
The agent's action selection is modeled as a map called ''policy'':
:<math>\pi: \mathcal{A} \times \mathcal{S} \rightarrow [0,1]</math>
:<math>\pi(a,s) = \Pr(A_t = a \mid S_t = s)</math>

The policy map gives the probability of taking action <math>a</math> when in state <math>s</math>.<ref name=":0">{{Cite web|url=http://people.inf.elte.hu/lorincz/Files/RL_2006/SuttonBook.pdf|title=Reinforcement learning: An introduction|access-date=2017-07-23|archive-date=2017-07-12|archive-url=https://web.archive.org/web/20170712170739/http://people.inf.elte.hu/lorincz/Files/RL_2006/SuttonBook.pdf|url-status=dead}}</ref>{{Rp|61}} There are also deterministic policies  <math>\pi</math> for which <math>\pi(s)</math> denotes the action that should be played at state <math>s</math>.

==== State-value function ====
The state-value function <math>V_\pi(s)</math> is defined as, ''expected discounted return'' starting with state <math>s</math>, i.e. <math>S_0 = s</math>, and successively following policy <math>\pi</math>. Hence, roughly speaking, the value function estimates "how good" it is to be in a given state.<ref name=":0" />{{Rp|60}}

:<math>V_\pi(s) = \operatorname \mathbb{E}[G\mid S_0 = s] = \operatorname \mathbb{E}\left[\sum_{t=0}^\infty \gamma^t R_{t+1}\mid S_0 = s\right],</math>

where the random variable <math>G</math> denotes the '''discounted return''', and is defined as the sum of future discounted rewards:

:<math>G=\sum_{t=0}^\infty \gamma^t R_{t+1}=R_1 + \gamma R_2 + \gamma^2 R_3 + \dots,</math>

where <math>R_{t+1}</math> is the reward for transitioning from state <math>S_t</math> to <math>S_{t+1}</math>, <math>0 \le \gamma<1</math> is the [[Q-learning#Discount factor|discount rate]]. <math>\gamma</math> is less than 1, so rewards in the distant future are weighted less than rewards in the immediate future.

The algorithm must find a policy with maximum expected discounted return. From the theory of Markov decision processes it is known that, without loss of generality, the search can be restricted to the set of so-called ''stationary'' policies. A policy is ''stationary'' if the action-distribution returned by it depends only on the last state visited (from the observation agent's history). The search can be further restricted to ''deterministic'' stationary policies. A ''deterministic stationary'' policy deterministically selects actions based on the current state. Since any such policy can be identified with a mapping from the set of states to the set of actions, these policies can be identified with such mappings with no loss of generality.