Editing Reinforcement learning (section)

== Principles ==
Due to its generality, reinforcement learning is studied in many disciplines, such as [[game theory]], [[control theory]], [[operations research]], [[information theory]], [[simulation-based optimization]], [[multi-agent system]]s, [[swarm intelligence]], and [[statistics]]. In the operations research and control literature, RL is called ''approximate dynamic programming'', or ''neuro-dynamic programming.'' The problems of interest in RL have also been studied in the [[optimal control theory|theory of optimal control]], which is concerned mostly with the existence and characterization of optimal solutions, and algorithms for their exact computation, and less with learning or approximation (particularly in the absence of a mathematical model of the environment).

Basic reinforcement learning is modeled as a [[Markov decision process]]:

* A set of environment and agent states (the state space), <math>\mathcal{S}</math>;
* A set of actions (the action space), <math>\mathcal{A}</math>, of the agent;
* <math>P_a(s,s')=\Pr(S_{t+1}=s'\mid S_t=s, A_t=a)</math>, the transition probability (at time <math>t</math>) from state <math>s</math> to state <math>s'</math> under action <math>a</math>.
* <math>R_a(s,s')</math>, the immediate reward after transition from <math>s</math> to <math>s'</math> under action <math>a</math>.

The purpose of reinforcement learning is for the agent to learn an optimal (or near-optimal) policy that maximizes the reward function or other user-provided reinforcement signal that accumulates from immediate rewards. This is similar to [[Reinforcement|processes]] that appear to occur in animal psychology. For example, biological brains are hardwired to interpret signals such as pain and hunger as negative reinforcements, and interpret pleasure and food intake as positive reinforcements. In some circumstances, animals learn to adopt behaviors that optimize these rewards. This suggests that animals are capable of reinforcement learning.<ref>{{cite book |last1=Russell |first1=Stuart J. |last2=Norvig |first2=Peter |title=Artificial intelligence : a modern approach |date=2010 |location=Upper Saddle River, New Jersey |publisher=[[Prentice Hall]] |isbn=978-0-13-604259-4 |pages=830, 831 |edition=Third}}</ref><ref>{{cite journal |last1=Lee |first1=Daeyeol |last2=Seo |first2=Hyojung |last3=Jung |first3=Min Whan |title=Neural Basis of Reinforcement Learning and Decision Making |journal=Annual Review of Neuroscience |date=21 July 2012 |volume=35 |issue=1 |pages=287–308 |doi=10.1146/annurev-neuro-062111-150512|pmid=22462543 |pmc=3490621 }}</ref>

A basic reinforcement learning agent interacts with its environment in discrete time steps. At each time step {{mvar|t}}, the agent receives the current state <math>S_t</math> and reward <math>R_t</math>. It then chooses an action <math>A_t</math> from the set of available actions, which is subsequently sent to the environment. The environment moves to a new state <math>S_{t+1}</math> and the reward <math>R_{t+1}</math> associated with the ''transition'' <math>(S_t,A_t,S_{t+1})</math> is determined. The goal of a reinforcement learning agent is to learn a ''policy'':

<math>\pi: \mathcal{S} \times \mathcal{A} \rightarrow [0,1] </math>, <math>\pi(s,a) = \Pr(A_t = a\mid S_t =s)</math>

that maximizes the expected cumulative reward.

Formulating the problem as a Markov decision process assumes the agent directly observes the current environmental state; in this case, the problem is said to have ''full observability''. If the agent only has access to a subset of states, or if the observed states are corrupted by noise, the agent is said to have ''partial observability'', and formally the problem must be formulated as a [[partially observable Markov decision process]]. In both cases, the set of actions available to the agent can be restricted. For example, the state of an account balance could be restricted to be positive; if the current value of the state is 3 and the state transition attempts to reduce the value by 4, the transition will not be allowed.

When the agent's performance is compared to that of an agent that acts optimally, the difference in performance yields the notion of [[Regret (decision theory)|regret]]. In order to act near optimally, the agent must reason about long-term consequences of its actions (i.e., maximize future rewards), although the immediate reward associated with this might be negative.

Thus, reinforcement learning is particularly well-suited to problems that include a long-term versus short-term reward trade-off. It has been applied successfully to various problems, including [[energy storage]],<ref>{{cite journal | doi=10.1016/j.epsr.2022.108515 | title=Community energy storage operation via reinforcement learning with eligibility traces | date=2022 | last1=Salazar Duque | first1=Edgar Mauricio | last2=Giraldo | first2=Juan S. | last3=Vergara | first3=Pedro P. | last4=Nguyen | first4=Phuong | last5=Van Der Molen | first5=Anne | last6=Slootweg | first6=Han | journal=Electric Power Systems Research | volume=212 | s2cid=250635151 | doi-access=free | bibcode=2022EPSR..21208515S }}</ref> [[robot control]],<ref>{{cite arXiv | eprint=2005.04323 | last1=Xie | first1=Zhaoming | author2=Hung Yu Ling | author3=Nam Hee Kim | author4=Michiel van de Panne | title=ALLSTEPS: Curriculum-driven Learning of Stepping Stone Skills | date=2020 | class=cs.GR }}</ref> [[Photovoltaic system|photovoltaic generators]],<ref>{{cite journal | doi=10.1016/j.ijepes.2021.107628 | title=Optimal dispatch of PV inverters in unbalanced distribution systems using Reinforcement Learning | date=2022 | last1=Vergara | first1=Pedro P. | last2=Salazar | first2=Mauricio | last3=Giraldo | first3=Juan S. | last4=Palensky | first4=Peter | journal=International Journal of Electrical Power & Energy Systems | volume=136 | s2cid=244099841 | doi-access=free | bibcode=2022IJEPE.13607628V }}</ref> [[backgammon]], [[checkers]],{{Sfn|Sutton|Barto|2018|p=|loc=Chapter 11}} [[Go (game)|Go]] ([[AlphaGo]]), and [[Self-driving car|autonomous driving systems]].<ref name="Ren-2022">{{cite journal | url=https://ieeexplore.ieee.org/document/9857655 | title=Self-Learned Intelligence for Integrated Decision and Control of Automated Vehicles at Signalized Intersections | date=2022 | doi=10.1109/TITS.2022.3196167 | last1=Ren | first1=Yangang | last2=Jiang | first2=Jianhua | last3=Zhan | first3=Guojian | last4=Li | first4=Shengbo Eben | last5=Chen | first5=Chen | last6=Li | first6=Keqiang | last7=Duan | first7=Jingliang | journal=IEEE Transactions on Intelligent Transportation Systems | volume=23 | issue=12 | pages=24145–24156 | arxiv=2110.12359 }}</ref>

Two elements make reinforcement learning powerful: the use of samples to optimize performance, and the use of [[Neural network (machine learning)|function approximation]] to deal with large environments. Thanks to these two key components, RL can be used in large environments in the following situations:

* A model of the environment is known, but an [[Closed-form expression|analytic solution]] is not available;
* Only a simulation model of the environment is given (the subject of [[simulation-based optimization]]);<ref>{{cite book|url = https://www.springer.com/mathematics/applications/book/978-1-4020-7454-7|title = Simulation-based Optimization: Parametric Optimization Techniques and Reinforcement|last = Gosavi|first = Abhijit|publisher = Springer|year = 2003|isbn = 978-1-4020-7454-7|author-link = Abhijit Gosavi|series = Operations Research/Computer Science Interfaces Series}}</ref>
* The only way to collect information about the environment is to interact with it.

The first two of these problems could be considered planning problems (since some form of model is available), while the last one could be considered to be a genuine learning problem. However, reinforcement learning converts both planning problems to [[machine learning]] problems.