Editing Markov decision process (section)

===Optimization objective===

The goal in a Markov decision process is to find a good "policy" for the decision maker: a function <math>\pi</math> that specifies the action <math>\pi(s)</math> that the decision maker will choose when in state <math>s</math>. Once a Markov decision process is combined with a policy in this way, this fixes the action for each state and the resulting combination behaves like a [[Markov chain]] (since the action chosen in state <math>s</math> is completely determined by <math>\pi(s)</math>).

The objective is to choose a policy <math>\pi</math> that will maximize some cumulative function of the random rewards, typically the expected discounted sum over a potentially infinite horizon:

:<math>E\left[\sum^{\infty}_{t=0} {\gamma^t R_{a_t} (s_t, s_{t+1})}\right] </math> (where we choose <math>a_t = \pi(s_t)</math>, i.e. actions given by the policy). And the expectation is taken over <math>s_{t+1} \sim P_{a_t}(s_t,s_{t+1})</math>

where <math>\ \gamma \ </math> is the discount factor satisfying <math>0 \le\ \gamma\ \le\ 1</math>, which is usually close to <math>1</math> (for example, <math> \gamma = 1/(1+r) </math> for some discount rate <math>r</math>). A lower discount factor motivates the decision maker to favor taking actions early, rather than postpone them indefinitely.

Another possible, but strictly related, objective that is commonly used is the <math>H-</math>step return. This time, instead of using a discount factor <math>\ \gamma \ </math>, the agent is interested only in the first <math>H</math> steps of the process, with each reward having the same weight.

:<math>E\left[\sum^{H-1}_{t=0} {R_{a_t} (s_t, s_{t+1})}\right] </math> (where we choose <math>a_t = \pi(s_t)</math>, i.e. actions given by the policy). And the expectation is taken over <math>s_{t+1} \sim P_{a_t}(s_t,s_{t+1})</math>

where <math>\ H \ </math> is the time horizon. Compared to the previous objective, the latter one is more used in [[Learning Theory]].

A policy that maximizes the function above is called an ''<dfn>optimal policy</dfn>'' and is usually denoted <math>\pi^*</math>. A particular MDP may have multiple distinct optimal policies. Because of the [[Markov property]], it can be shown that the optimal policy is a function of the current state, as assumed above.