Editing Markov decision process (section)

====Policy iteration====
In policy iteration {{harv|Howard|1960}}, step one is performed once, and then step two is performed once, then both are repeated until policy converges. Then step one is again performed once and so on. (Policy iteration was invented by Howard to optimize [[Sears]] catalogue mailing, which he had been optimizing using value iteration.<ref>Howard 2002, [https://pubsonline.informs.org/doi/10.1287/opre.50.1.100.17788 "Comments on the Origin and Application of Markov Decision Processes"]</ref>)

Instead of repeating step two to convergence, it may be formulated and solved as a set of linear equations. These equations are merely obtained by making <math>s = s'</math> in the step two equation.{{clarify|date=January 2018}} Thus, repeating step two to convergence can be interpreted as solving the linear equations by [[Relaxation (iterative method)|relaxation]].

This variant has the advantage that there is a definite stopping condition: when the array <math>\pi</math> does not change in the course of applying step 1 to all states, the algorithm is completed.

Policy iteration is usually slower than value iteration for a large number of possible states.