Editing Reinforcement learning (section)

=== Model-based algorithms ===
Finally, all of the above methods can be combined with algorithms that first learn a model of the [[Markov decision process]], the probability of each next state given an action taken from an existing state. For instance, the Dyna algorithm learns a model from experience, and uses that to provide more modelled transitions for a value function, in addition to the real transitions.<ref>{{Cite conference |last1=Sutton |first1=Richard| title=Integrated Architectures for Learning, Planning and Reacting based on Dynamic Programming |year=1990 |book-title=Machine Learning: Proceedings of the Seventh International Workshop}}</ref> Such methods can sometimes be extended to use of non-parametric models, such as when the transitions are simply stored and "replayed" to the learning algorithm.<ref>{{Cite conference | first1 = Long-Ji | last1 = Lin | title = Self-improving reactive agents based on reinforcement learning, planning and teaching | book-title = Machine Learning volume 8 | year = 1992 | doi = 10.1007/BF00992699 |url=https://link.springer.com/content/pdf/10.1007/BF00992699.pdf}}</ref>

Model-based methods can be more computationally intensive than model-free approaches, and their utility can be limited by the extent to which the Markov decision process can be learnt.<ref>{{Citation |last=Zou |first=Lan |title=Chapter 7 - Meta-reinforcement learning |date=2023-01-01 |url=https://www.sciencedirect.com/science/article/pii/B9780323899314000110 |work=Meta-Learning |pages=267–297 |editor-last=Zou |editor-first=Lan |access-date=2023-11-08 |publisher=Academic Press |doi=10.1016/b978-0-323-89931-4.00011-0 |isbn=978-0-323-89931-4}}</ref>

There are other ways to use models than to update a value function.<ref>{{Cite conference
| last1 = van Hasselt | first1 = Hado
| last2 = Hessel | first2 = Matteo
| last3 = Aslanides | first3 = John
| title = When to use parametric models in reinforcement learning?
| year = 2019
| book-title = Advances in Neural Information Processing Systems 32
| url = https://proceedings.neurips.cc/paper/2019/file/1b742ae215adf18b75449c6e272fd92d-Paper.pdf
}}</ref> For instance, in [[model predictive control]] the model is used to update the behavior directly.