Editing Markov decision process (section)

====Value iteration====
In value iteration {{harv|Bellman|1957}}, which is also called [[backward induction]],
the <math>\pi</math> function is not used; instead, the value of <math>\pi(s)</math> is calculated within <math>V(s)</math> whenever it is needed. Substituting the calculation of <math>\pi(s)</math> into the calculation of <math>V(s)</math> gives the combined step{{explain|reason=The derivation of the substituion is needed|date=July 2018}}:
:<math> V_{i+1}(s) := \max_a \left\{ \sum_{s'} P_a(s,s') \left( R_a(s,s') + \gamma V_i(s') \right) \right\}, </math>

where <math>i</math> is the iteration number. Value iteration starts at <math>i = 0</math> and <math>V_0</math> as a guess of the [[value function]]. It then iterates, repeatedly computing <math>V_{i+1}</math> for all states <math>s</math>, until <math>V</math> converges with the left-hand side equal to the right-hand side (which is the "[[Bellman equation]]" for this problem{{clarify|date=January 2018}}). [[Lloyd Shapley]]'s 1953 paper on [[stochastic games]] included as a special case the value iteration method for MDPs,<ref>{{cite journal|last=Shapley|first=Lloyd|author-link=Lloyd Shapley|title=Stochastic Games|year=1953|journal=Proceedings of the National Academy of Sciences of the United States of America|volume=39|issue=10|pages=1095–1100|doi=10.1073/pnas.39.10.1095|pmid=16589380|pmc=1063912|bibcode=1953PNAS...39.1095S|doi-access=free}}</ref> but this was recognized only later on.<ref>{{cite book|first=Lodewijk|last=Kallenberg|chapter=Finite state and action MDPs|editor-first1=Eugene A.|editor-last1=Feinberg|editor-link1=Eugene A. Feinberg|editor-first2=Adam|editor-last2=Shwartz|title=Handbook of Markov decision processes: methods and applications|publisher=Springer|year=2002|isbn=978-0-7923-7459-6}}</ref>