Editing Markov decision process (section)

=== Learning automata ===
{{main|Learning automata}}
Another application of MDP process in [[machine learning]] theory is called learning automata. This is also one type of reinforcement learning if the environment is stochastic. The first detail '''learning automata''' paper is surveyed by [[Kumpati S. Narendra|Narendra]] and Thathachar (1974), which were originally described explicitly as [[finite-state automata]].<ref>{{Cite journal|last1=Narendra|first1=K. S.|author-link=Kumpati S. Narendra|last2=Thathachar|first2=M. A. L.|year=1974 |title=Learning Automata – A Survey|journal=IEEE Transactions on Systems, Man, and Cybernetics|volume=SMC-4|issue=4|pages=323–334|doi=10.1109/TSMC.1974.5408453|issn=0018-9472|citeseerx=10.1.1.295.2280}}</ref> Similar to reinforcement learning, a learning automata algorithm also has the advantage of solving the problem when probability or rewards are unknown. The difference between learning automata and Q-learning is that the former technique omits the memory of Q-values, but updates the action probability directly to find the learning result. Learning automata is a learning scheme with a rigorous proof of convergence.<ref name="NarendraEtAl1989">{{Cite book|url=https://archive.org/details/learningautomata00nare|url-access=registration|title=Learning automata: An introduction|last1=Narendra|first1=Kumpati S.|author-link=Kumpati S. Narendra|last2=Thathachar|first2=Mandayam A. L.|year=1989|publisher=Prentice Hall|isbn=9780134855585|language=en}}</ref>

In learning automata theory, '''a stochastic automaton''' consists of:
* a set ''x'' of possible inputs,
* a set Φ = { Φ<sub>1</sub>, ..., Φ<sub>''s''</sub> } of possible internal states,
* a set α = { α<sub>1</sub>, ..., α<sub>''r''</sub> } of possible outputs, or actions, with ''r'' ≤ ''s'',
* an initial state probability vector ''p''(0) = ≪ ''p''<sub>1</sub>(0), ..., ''p<sub>s</sub>''(0) ≫,
* a [[computable function]] ''A'' which after each time step ''t'' generates ''p''(''t'' + 1) from ''p''(''t''), the current input, and the current state, and
* a function ''G'': Φ → α which generates the output at each time step.
The states of such an automaton correspond to the states of a "discrete-state discrete-parameter [[Markov process]]".{{sfn|Narendra|Thathachar|1974|loc=p.325 left}} At each time step ''t'' = 0,1,2,3,..., the automaton reads an input from its environment, updates P(''t'') to P(''t'' + 1) by ''A'', randomly chooses a successor state according to the probabilities P(''t'' + 1) and outputs the corresponding action. The automaton's environment, in turn, reads the action and sends the next input to the automaton.<ref name="NarendraEtAl1989" />