Editing Markov decision process (section)

==Algorithms==

Solutions for MDPs with finite state and action spaces may be found through a variety of methods such as [[dynamic programming]]. The algorithms in this section apply to MDPs with finite state and action spaces and explicitly given transition probabilities and reward functions, but the basic concepts may be extended to handle other problem classes, for example using [[function approximation]]. Also, some processes with countably infinite state and action spaces can be <i>exactly</i> reduced to ones with finite state and action spaces.<ref name="Wrobel 1984">{{cite journal|first=A.|last=Wrobel|title=On Markovian decision models with a finite skeleton|journal=Zeitschrift für Operations Research|date=1984|volume=28|issue=1|pages=17–27|doi=10.1007/bf01919083|s2cid=2545336}}</ref>

The standard family of algorithms to calculate optimal policies for finite state and action MDPs requires storage for two arrays indexed by state: ''value'' <math>V</math>, which contains real values, and ''policy'' <math>\pi</math>, which contains actions. At the end of the algorithm, <math>\pi</math> will contain the solution and <math>V(s)</math> will contain the discounted sum of the rewards to be earned (on average) by following that solution from state <math>s</math>.

The algorithm has two steps, (1) a value update and (2) a policy update, which are repeated in some order for all the states until no further changes take place.  Both recursively update a new estimation of the optimal policy and state value using an older estimation of those values.

:<math> V(s) := \sum_{s'} P_{\pi(s)} (s,s') \left( R_{\pi(s)} (s,s') + \gamma V(s') \right) </math>

:<math> \pi(s) := \operatorname{argmax}_a \left\{ \sum_{s'} P_{a}(s, s') \left( R_{a}(s,s') + \gamma V(s') \right) \right\} </math>

Their order depends on the variant of the algorithm; one can also do them for all states at once or state by state, and more often to some states than others. As long as no state is permanently excluded from either of the steps, the algorithm will eventually arrive at the correct solution.<ref>{{Cite book|title=Reinforcement Learning: Theory and Python Implementation|publisher=China Machine Press|year=2019|isbn=9787111631774|location=Beijing|pages=44}}</ref>

===Notable variants===

====Value iteration====
In value iteration {{harv|Bellman|1957}}, which is also called [[backward induction]],
the <math>\pi</math> function is not used; instead, the value of <math>\pi(s)</math> is calculated within <math>V(s)</math> whenever it is needed. Substituting the calculation of <math>\pi(s)</math> into the calculation of <math>V(s)</math> gives the combined step{{explain|reason=The derivation of the substituion is needed|date=July 2018}}:
:<math> V_{i+1}(s) := \max_a \left\{ \sum_{s'} P_a(s,s') \left( R_a(s,s') + \gamma V_i(s') \right) \right\}, </math>

where <math>i</math> is the iteration number. Value iteration starts at <math>i = 0</math> and <math>V_0</math> as a guess of the [[value function]]. It then iterates, repeatedly computing <math>V_{i+1}</math> for all states <math>s</math>, until <math>V</math> converges with the left-hand side equal to the right-hand side (which is the "[[Bellman equation]]" for this problem{{clarify|date=January 2018}}). [[Lloyd Shapley]]'s 1953 paper on [[stochastic games]] included as a special case the value iteration method for MDPs,<ref>{{cite journal|last=Shapley|first=Lloyd|author-link=Lloyd Shapley|title=Stochastic Games|year=1953|journal=Proceedings of the National Academy of Sciences of the United States of America|volume=39|issue=10|pages=1095–1100|doi=10.1073/pnas.39.10.1095|pmid=16589380|pmc=1063912|bibcode=1953PNAS...39.1095S|doi-access=free}}</ref> but this was recognized only later on.<ref>{{cite book|first=Lodewijk|last=Kallenberg|chapter=Finite state and action MDPs|editor-first1=Eugene A.|editor-last1=Feinberg|editor-link1=Eugene A. Feinberg|editor-first2=Adam|editor-last2=Shwartz|title=Handbook of Markov decision processes: methods and applications|publisher=Springer|year=2002|isbn=978-0-7923-7459-6}}</ref>

====Policy iteration====
In policy iteration {{harv|Howard|1960}}, step one is performed once, and then step two is performed once, then both are repeated until policy converges. Then step one is again performed once and so on. (Policy iteration was invented by Howard to optimize [[Sears]] catalogue mailing, which he had been optimizing using value iteration.<ref>Howard 2002, [https://pubsonline.informs.org/doi/10.1287/opre.50.1.100.17788 "Comments on the Origin and Application of Markov Decision Processes"]</ref>)

Instead of repeating step two to convergence, it may be formulated and solved as a set of linear equations. These equations are merely obtained by making <math>s = s'</math> in the step two equation.{{clarify|date=January 2018}} Thus, repeating step two to convergence can be interpreted as solving the linear equations by [[Relaxation (iterative method)|relaxation]].

This variant has the advantage that there is a definite stopping condition: when the array <math>\pi</math> does not change in the course of applying step 1 to all states, the algorithm is completed.

Policy iteration is usually slower than value iteration for a large number of possible states.

====Modified policy iteration====
In modified policy iteration ({{harvnb|van Nunen|1976}}; {{harvnb|Puterman|Shin|1978}}), step one is performed once, and then step two is repeated several times.<ref>{{cite journal|first1=M. L.|last1=Puterman|last2=Shin|first2=M. C.|title=Modified Policy Iteration Algorithms for Discounted Markov Decision Problems|journal=Management Science|volume=24|issue=11|year=1978|doi=10.1287/mnsc.24.11.1127|pages=1127–1137}}</ref><ref>{{cite journal|first=J.A. E. E|last=van Nunen|title=A set of successive approximation methods for discounted Markovian decision problems |journal= Zeitschrift für Operations Research|volume=20|issue=5|pages=203–208|year=1976|doi=10.1007/bf01920264|s2cid=5167748}}</ref> Then step one is again performed once and so on.

====Prioritized sweeping====
In this variant, the steps are preferentially applied to states which are in some way important – whether based on the algorithm (there were large changes in <math>V</math> or <math>\pi</math> around those states recently) or based on use (those states are near the starting state, or otherwise of interest to the person or program using the algorithm).

===Computational complexity===

Algorithms for finding optimal policies with [[time complexity]] polynomial in the size of the problem representation exist for finite MDPs. Thus, [[decision problem]]s based on MDPs are in computational [[complexity class]] [[P (complexity)|P]].<ref>{{cite journal
| last1=Papadimitriou | first1=Christos | authorlink1=Christos Papadimitriou
| last2=Tsitsiklis | first2=John | authorlink2=John Tsitsiklis
| date=1987
| title=The Complexity of Markov Decision Processes
| url=https://pubsonline.informs.org/doi/abs/10.1287/moor.12.3.441
| journal=[[Mathematics of Operations Research]]
| volume=12
| issue=3
| pages=441–450 | doi=10.1287/moor.12.3.441
| access-date=November 2, 2023| hdl=1721.1/2893
| hdl-access=free
}}</ref> However, due to the [[curse of dimensionality]], the size of the problem representation is often exponential in the number of state and action variables, limiting exact solution techniques to problems that have a compact representation. In practice, online planning techniques such as [[Monte Carlo tree search]] can find useful solutions in larger problems, and, in theory, it is possible to construct online planning algorithms that can find an arbitrarily near-optimal policy with no computational complexity dependence on the size of the state space.<ref>{{cite journal|last1=Kearns|first1=Michael|last2=Mansour|first2=Yishay|last3=Ng|first3=Andrew|date=November 2002|title=A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes|journal=Machine Learning|volume=49|issue=2/3 |pages=193–208 |doi=10.1023/A:1017932429737|doi-access=free}}</ref>