Editing Reinforcement learning (section)

== Comparison of key algorithms ==
The following table lists the key algorithms for learning a policy depending on several criteria:

* The algorithm can be on-policy (it performs policy updates using trajectories sampled via the current policy)<ref>cf. {{harvnb|Sutton|Barto|2018|loc=Section 5.4, p. 100}}</ref> or off-policy.
* The action space may be discrete (e.g. the action space could be "going up", "going left", "going right", "going down", "stay") or continuous (e.g. moving the arm with a given angle).
* The state space may be discrete (e.g. the agent could be in a cell in a grid) or continuous (e.g. the agent could be located at a given position in the plane).

{| class="wikitable sortable"
|-
! Algorithm !! Description !!Policy !! Action space !! State space !! Operator
|-
| [[Monte Carlo method|Monte Carlo]] || Every visit to Monte Carlo ||  Either || Discrete || Discrete || Sample-means of state-values or action-values
|-
| [[Temporal difference learning|TD learning]] || State–action–reward–state ||  Off-policy || Discrete || Discrete || State-value
|-
| [[Q-learning]] || State–action–reward–state ||  Off-policy || Discrete || Discrete || Action-value
|-
| [[State–action–reward–state–action|SARSA]] || State–action–reward–state–action  || On-policy || Discrete || Discrete || Action-value
|-
| [[Q-learning#Deep Q-learning|DQN]] || Deep Q Network  || Off-policy || Discrete || Continuous || Action-value
|-
| DDPG || Deep Deterministic Policy Gradient || Off-policy || Continuous || Continuous || Action-value
|-
| A3C || Asynchronous Advantage Actor-Critic Algorithm || On-policy || Discrete || Continuous || Advantage (=action-value - state-value)
|-
| TRPO || Trust Region Policy Optimization ||  On-policy || Continuous or Discrete || Continuous || Advantage
|-
| [[Proximal Policy Optimization|PPO]] || Proximal Policy Optimization ||  On-policy || Continuous or Discrete || Continuous || Advantage
|-
|TD3
|Twin Delayed Deep Deterministic Policy Gradient
|Off-policy
|Continuous
|Continuous
|Action-value
|-
|SAC
|Soft Actor-Critic
|Off-policy
|Continuous
|Continuous
|Advantage
|-
|[[Distributional Soft Actor Critic|DSAC]]<ref>{{cite journal|author1=J Duan |author2=Y Guan| author3=S Li| title= Distributional Soft Actor-Critic: Off-policy reinforcement learning for addressing value estimation errors| journal=  IEEE Transactions on Neural Networks and Learning Systems |volume=33 | issue=11 |year= 2021 |pages= 6584–6598 |doi=10.1109/TNNLS.2021.3082568 |pmid=34101599 |arxiv=2001.02811 |s2cid=211259373 |url= https://ieeexplore.ieee.org/document/9448360 }}</ref><ref>{{cite book|author1=Y Ren |author2=J Duan| author3=S Li|title=2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC) |chapter=Improving Generalization of Reinforcement Learning with Minimax Distributional Soft Actor-Critic |year= 2020 |pages=1–6 |doi=10.1109/ITSC45102.2020.9294300 |arxiv=2002.05502 |isbn=978-1-7281-4149-7 |s2cid=211096594 |chapter-url= https://ieeexplore.ieee.org/document/9294300 }}</ref><ref>{{cite journal |last1=Duan |first1=J |last2=Wang |first2=W |last3=Xiao |first3=L |date=2025 |title=Distributional Soft Actor-Critic with Three Refinements |journal=IEEE Transactions on Pattern Analysis and Machine Intelligence |volume=PP |issue=5 |pages=3935–3946 |doi=10.1109/TPAMI.2025.3537087 |pmid=40031258 |arxiv=2310.05858 }}</ref> ||Distributional Soft Actor Critic ||Off-policy ||Continuous ||Continuous ||Action-value distribution
|}

=== Associative reinforcement learning ===
Associative reinforcement learning tasks combine facets of stochastic learning automata tasks and supervised learning pattern classification tasks. In associative reinforcement learning tasks, the learning system interacts in a closed loop with its environment.<ref>{{cite book |last1=Soucek |first1=Branko |title=Dynamic, Genetic and Chaotic Programming: The Sixth-Generation Computer Technology Series |date=6 May 1992 |publisher=John Wiley & Sons, Inc |isbn=0-471-55717-X |page=38}}</ref>

=== Deep reinforcement learning ===
This approach extends reinforcement learning by using a deep neural network and without explicitly designing the state space.<ref name="intro_deep_RL">{{cite journal |first= Vincent|display-authors=etal|last= Francois-Lavet |year=2018 |title= An Introduction to Deep Reinforcement Learning |journal=Foundations and Trends in Machine Learning|volume=11 |issue=3–4 |pages=219–354 |doi=10.1561/2200000071|arxiv= 1811.12560 |bibcode=2018arXiv181112560F|s2cid=54434537}}</ref> The work on learning ATARI games by Google [[DeepMind]] increased attention to [[deep reinforcement learning]] or [[end-to-end reinforcement learning]].<ref name="DQN2">{{cite journal |first= Volodymyr|display-authors=etal|last= Mnih |year=2015 |title= Human-level control through deep reinforcement learning |journal=Nature|volume=518 |issue=7540 |pages=529–533 |doi=10.1038/nature14236|pmid= 25719670 |bibcode=2015Natur.518..529M |s2cid=205242740}}</ref>

=== Adversarial deep reinforcement learning ===
Adversarial deep reinforcement learning is an active area of research in reinforcement learning focusing on vulnerabilities of learned policies. In this research area some studies initially showed that reinforcement learning policies are susceptible to imperceptible adversarial manipulations.<ref>{{cite journal |last1= Goodfellow|first1=Ian  |last2=Shlens |first2= Jonathan|last3=Szegedy|first3=Christian|title= Explaining and Harnessing Adversarial Examples |journal= International Conference on Learning Representations |date= 2015 |arxiv=1412.6572 }}</ref><ref>{{cite book |last1= Behzadan|first1=Vahid  |last2=Munir |first2= Arslan|title=Machine Learning and Data Mining in Pattern Recognition |chapter=Vulnerability of Deep Reinforcement Learning to Policy Induction Attacks |series=Lecture Notes in Computer Science |date= 2017 |volume=10358 |pages=262–275 |doi=10.1007/978-3-319-62416-7_19 |arxiv=1701.04143|isbn=978-3-319-62415-0 |s2cid=1562290 }}</ref><ref>{{Cite book |last1=Huang |first1=Sandy |last2=Papernot |first2=Nicolas |last3=Goodfellow |first3=Ian |last4=Duan |first4=Yan |last5=Abbeel |first5=Pieter |url=http://worldcat.org/oclc/1106256905 |title=Adversarial Attacks on Neural Network Policies |date=2017-02-07 |oclc=1106256905}}</ref> While some methods have been proposed to overcome these susceptibilities, in the most recent studies it has been shown that these proposed solutions are far from providing an accurate representation of current vulnerabilities of deep reinforcement learning policies.<ref>{{cite journal |last1=Korkmaz |first1=Ezgi |date=2022 |title=Deep Reinforcement Learning Policies Learn Shared Adversarial Features Across MDPs. |journal=Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22) |volume=36 |issue=7 |pages=7229–7238 |doi=10.1609/aaai.v36i7.20684 |arxiv=2112.09025|s2cid=245219157 |doi-access=free }}</ref>

=== Fuzzy reinforcement learning ===
By introducing [[Fuzzy control system|fuzzy inference]] in reinforcement learning,<ref>{{Cite book |last=Berenji |first=H.R. |title=Proceedings of 1994 IEEE 3rd International Fuzzy Systems Conference |chapter=Fuzzy Q-learning: A new approach for fuzzy dynamic programming |date=1994 |chapter-url=https://ieeexplore.ieee.org/document/343737 |location=Orlando, FL, USA |publisher=IEEE |pages=486–491 |doi=10.1109/FUZZY.1994.343737|isbn=0-7803-1896-X |s2cid=56694947 }}</ref> approximating the state-action value function with [[fuzzy rule]]s in continuous space becomes possible. The IF - THEN form of fuzzy rules make this approach suitable for expressing the results in a form close to natural language. Extending FRL with Fuzzy Rule Interpolation<ref>{{Cite book |last=Vincze |first=David |title=2017 IEEE 15th International Symposium on Applied Machine Intelligence and Informatics (SAMI) |date=2017 |chapter=Fuzzy rule interpolation and reinforcement learning |chapter-url=http://users.iit.uni-miskolc.hu/~vinczed/research/vinczed_sami2017_author_draft.pdf |publisher=IEEE |pages=173–178 |doi=10.1109/SAMI.2017.7880298|isbn=978-1-5090-5655-2 |s2cid=17590120 }}</ref> allows the use of reduced size sparse fuzzy rule-bases to emphasize cardinal rules (most important state-action values).

=== Inverse reinforcement learning ===
In inverse reinforcement learning (IRL), no reward function is given. Instead, the reward function is inferred given an observed behavior from an expert. The idea is to mimic observed behavior, which is often optimal or close to optimal.<ref>{{cite book |last1=Ng |first1=A. Y. |last2=Russell |first2=S. J. |year=2000 |chapter=Algorithms for Inverse Reinforcement Learning |title=Proceeding ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning |pages=663–670 |publisher=Morgan Kaufmann Publishers |isbn=1-55860-707-2 |chapter-url=https://ai.stanford.edu/~ang/papers/icml00-irl.pdf }}</ref> One popular IRL paradigm is named maximum entropy inverse reinforcement learning (MaxEnt IRL).<ref>{{Cite journal |last1=Ziebart |first1=Brian D. |last2=Maas |first2=Andrew |last3=Bagnell |first3=J. Andrew |last4=Dey |first4=Anind K. |date=2008-07-13 |title=Maximum entropy inverse reinforcement learning |url=https://dl.acm.org/doi/10.5555/1620270.1620297 |journal=Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 3 |series=AAAI'08 |location=Chicago, Illinois |publisher=AAAI Press |pages=1433–1438 |isbn=978-1-57735-368-3 |s2cid=336219}}</ref> MaxEnt IRL estimates the parameters of a linear model of the reward function by maximizing the entropy of the probability distribution of observed trajectories subject to constraints related to matching expected feature counts. Recently it has been shown that MaxEnt IRL is a particular case of a more general framework named random utility inverse reinforcement learning (RU-IRL).<ref>{{Cite journal |last1=Pitombeira-Neto |first1=Anselmo R. |last2=Santos |first2=Helano P. |last3=Coelho da Silva |first3=Ticiana L. |last4=de Macedo |first4=José Antonio F. |date=March 2024 |title=Trajectory modeling via random utility inverse reinforcement learning |url=https://doi.org/10.1016/j.ins.2024.120128 |journal=Information Sciences |volume=660 |pages=120128 |doi=10.1016/j.ins.2024.120128 |issn=0020-0255 |s2cid=235187141|arxiv=2105.12092 }}</ref> RU-IRL is based on [[Random utility model|random utility theory]] and Markov decision processes. While prior IRL approaches assume that the apparent random behavior of an observed agent is due to it following a random policy, RU-IRL assumes that the observed agent follows a deterministic policy but randomness in observed behavior is due to the fact that an observer only has partial access to the features the observed agent uses in decision making. The utility function is modeled as a random variable to account for the ignorance of the observer regarding the features the observed agent actually considers in its utility function.

=== Multi-objective reinforcement learning ===
Multi-objective reinforcement learning (MORL) is a form of reinforcement learning concerned with conflicting alternatives. It is distinct from multi-objective optimization in that it is concerned with agents acting in environments.<ref>{{cite journal |vauthors=Hayes C, Radulescu R, Bargiacchi E, et al |date= 2022 |title=A practical guide to multi-objective reinforcement learning and planning|journal= Autonomous Agents and Multi-Agent Systems |volume= 36|doi= 10.1007/s10458-022-09552-y|s2cid= 254235920 |doi-access= free|arxiv= 2103.09568}},</ref><ref>{{cite book |title=Multiple Attribute Decision Making: Methods and Applications |edition=1st |first1=Gwo-Hshiung |last1=Tzeng |first2=Jih-Jeng |last2=Huang |date=2011 |publisher=CRC Press |isbn=9781439861578}}</ref>

=== Safe reinforcement learning ===
Safe reinforcement learning (SRL) can be defined as the process of learning policies that maximize the expectation of the return in problems in which it is important to ensure reasonable system performance and/or respect safety constraints during the learning and/or deployment processes.<ref>{{cite journal |last1=García |first1=Javier |last2=Fernández |first2=Fernando |title=A comprehensive survey on safe reinforcement learning |url=https://jmlr.org/papers/volume16/garcia15a/garcia15a.pdf |journal=The Journal of Machine Learning Research |date=1 January 2015 |volume=16 |issue=1 |pages=1437–1480 }}</ref> An alternative approach is risk-averse reinforcement learning, where instead of the ''expected'' return, a ''risk-measure'' of the return is optimized, such as the [[Expected shortfall|conditional value at risk]] (CVaR).<ref>{{Cite journal |last1=Dabney |first1=Will |last2=Ostrovski |first2=Georg |last3=Silver |first3=David |last4=Munos |first4=Remi |date=2018-07-03 |title=Implicit Quantile Networks for Distributional Reinforcement Learning |url=https://proceedings.mlr.press/v80/dabney18a.html |journal=Proceedings of the 35th International Conference on Machine Learning |language=en |publisher=PMLR |pages=1096–1105|arxiv=1806.06923 }}</ref> In addition to mitigating risk, the CVaR objective increases robustness to model uncertainties.<ref>{{Cite journal |last1=Chow |first1=Yinlam |last2=Tamar |first2=Aviv |last3=Mannor |first3=Shie |last4=Pavone |first4=Marco |date=2015 |title=Risk-Sensitive and Robust Decision-Making: a CVaR Optimization Approach |url=https://proceedings.neurips.cc/paper/2015/hash/64223ccf70bbb65a3a4aceac37e21016-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=28|arxiv=1506.02188 }}</ref><ref>{{Cite web |title=Train Hard, Fight Easy: Robust Meta Reinforcement Learning |url=https://scholar.google.com/citations?view_op=view_citation&hl=en&user=LnwyFkkAAAAJ&citation_for_view=LnwyFkkAAAAJ:eQOLeE2rZwMC |access-date=2024-06-21 |website=scholar.google.com}}</ref> However, CVaR optimization in risk-averse RL requires special care, to prevent gradient bias<ref>{{Cite journal |last1=Tamar |first1=Aviv |last2=Glassner |first2=Yonatan |last3=Mannor |first3=Shie |date=2015-02-21 |title=Optimizing the CVaR via Sampling |url=https://ojs.aaai.org/index.php/AAAI/article/view/9561 |journal=Proceedings of the AAAI Conference on Artificial Intelligence |language=en |volume=29 |issue=1 |doi=10.1609/aaai.v29i1.9561 |issn=2374-3468|arxiv=1404.3862 }}</ref> and blindness to success.<ref>{{Cite journal |last1=Greenberg |first1=Ido |last2=Chow |first2=Yinlam |last3=Ghavamzadeh |first3=Mohammad |last4=Mannor |first4=Shie |date=2022-12-06 |title=Efficient Risk-Averse Reinforcement Learning |url=https://proceedings.neurips.cc/paper_files/paper/2022/hash/d2511dfb731fa336739782ba825cd98c-Abstract-Conference.html |journal=Advances in Neural Information Processing Systems |language=en |volume=35 |pages=32639–32652|arxiv=2205.05138 }}</ref>

=== Self-reinforcement learning ===
Self-reinforcement learning (or self-learning), is a learning paradigm which does not use the concept of immediate reward <math>R_a(s,s')</math> after transition from <math>s</math> to <math>s'</math> with action <math>a</math>. It does not use an external reinforcement, it only uses the agent internal self-reinforcement. The internal self-reinforcement is provided by mechanism of feelings and emotions. In the learning process emotions are backpropagated by a mechanism of secondary reinforcement. The learning equation does not include the immediate reward, it only includes the state evaluation.

The self-reinforcement algorithm updates a memory matrix <math>W=||w(a,s)||</math> such that in each iteration executes the following machine learning routine:
# In situation <math>s</math> perform action <math>a</math>.
# Receive a consequence situation <math>s'</math>.
# Compute state evaluation <math>v(s')</math> of how good is to be in the consequence situation <math>s'</math>.
# Update crossbar memory <math>w'(a,s) = w(a,s) + v(s')</math>.

Initial conditions of the memory are received as input from the genetic environment. It is a system with only one input (situation), and only one output (action, or behavior).

Self-reinforcement (self-learning) was introduced in 1982 along with a neural network capable of self-reinforcement learning, named Crossbar Adaptive Array (CAA).<ref>Bozinovski, S. (1982). "A self-learning system using secondary reinforcement". In Trappl, Robert (ed.). Cybernetics and Systems Research: Proceedings of the Sixth European Meeting on Cybernetics and Systems Research. North-Holland. pp. 397–402. ISBN 978-0-444-86488-8</ref><ref>Bozinovski S. (1995) "Neuro genetic agents and structural theory of self-reinforcement learning systems". CMPSCI Technical Report 95-107, University of Massachusetts at Amherst [https://web.cs.umass.edu/publication/docs/1995/UM-CS-1995-107.pdf]</ref> The CAA computes, in a crossbar fashion, both decisions about actions and emotions (feelings) about consequence states. The system is driven by the interaction between cognition and emotion.<ref>Bozinovski, S. (2014) "Modeling mechanisms of cognition-emotion interaction in artificial neural networks, since 1981." Procedia Computer Science p. 255–263</ref>

=== Reinforcement Learning in Natural Language Processing ===
In recent years, Reinforcement learning has become a significant concept in [[Natural language processing|Natural Language Processing (NLP)]], where tasks are often sequential decision-making rather than static classification. Reinforcement learning is where an agent take actions in an environment to maximize the accumulation of rewards. This framework is best fit for many NLP tasks, including dialogue generation, text summarization, and machine translation, where the quality of the output depends on optimizing long-term or human-centered goals rather than the prediction of single correct label.

Early application of RL in NLP emerged in dialogue systems, where conversation was determined as a series of actions optimized for fluency and coherence. These early attempts, including policy gradient and sequence-level training techniques, laid a foundation for the broader application of reinforcement learning to other areas of NLP.

A major breakthrough happened with the introduction of [[Reinforcement learning from human feedback|Reinforcement Learning from Human Feedback (RLHF)]], a method in which human feedbacks are used to train a reward model that guides the RL agent. Unlike traditional rule-based or supervised systems, RLHF allows models to align their behavior with human judgments on complex and subjective tasks. This technique was initially used in the development of [[InstructGPT]], an effective language model trained to follow human instructions and later in [[ChatGPT]] which incorporates RLHF for improving output responses and ensuring safety.

More recently, researchers have explored the use of offline RL in NLP to improve dialogue systems without the need of live human interaction. These methods optimize for user engagement, coherence, and diversity based on past conversation logs and pre-trained reward models.