Editing Reinforcement learning (section)

=== Safe reinforcement learning ===
Safe reinforcement learning (SRL) can be defined as the process of learning policies that maximize the expectation of the return in problems in which it is important to ensure reasonable system performance and/or respect safety constraints during the learning and/or deployment processes.<ref>{{cite journal |last1=García |first1=Javier |last2=Fernández |first2=Fernando |title=A comprehensive survey on safe reinforcement learning |url=https://jmlr.org/papers/volume16/garcia15a/garcia15a.pdf |journal=The Journal of Machine Learning Research |date=1 January 2015 |volume=16 |issue=1 |pages=1437–1480 }}</ref> An alternative approach is risk-averse reinforcement learning, where instead of the ''expected'' return, a ''risk-measure'' of the return is optimized, such as the [[Expected shortfall|conditional value at risk]] (CVaR).<ref>{{Cite journal |last1=Dabney |first1=Will |last2=Ostrovski |first2=Georg |last3=Silver |first3=David |last4=Munos |first4=Remi |date=2018-07-03 |title=Implicit Quantile Networks for Distributional Reinforcement Learning |url=https://proceedings.mlr.press/v80/dabney18a.html |journal=Proceedings of the 35th International Conference on Machine Learning |language=en |publisher=PMLR |pages=1096–1105|arxiv=1806.06923 }}</ref> In addition to mitigating risk, the CVaR objective increases robustness to model uncertainties.<ref>{{Cite journal |last1=Chow |first1=Yinlam |last2=Tamar |first2=Aviv |last3=Mannor |first3=Shie |last4=Pavone |first4=Marco |date=2015 |title=Risk-Sensitive and Robust Decision-Making: a CVaR Optimization Approach |url=https://proceedings.neurips.cc/paper/2015/hash/64223ccf70bbb65a3a4aceac37e21016-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=28|arxiv=1506.02188 }}</ref><ref>{{Cite web |title=Train Hard, Fight Easy: Robust Meta Reinforcement Learning |url=https://scholar.google.com/citations?view_op=view_citation&hl=en&user=LnwyFkkAAAAJ&citation_for_view=LnwyFkkAAAAJ:eQOLeE2rZwMC |access-date=2024-06-21 |website=scholar.google.com}}</ref> However, CVaR optimization in risk-averse RL requires special care, to prevent gradient bias<ref>{{Cite journal |last1=Tamar |first1=Aviv |last2=Glassner |first2=Yonatan |last3=Mannor |first3=Shie |date=2015-02-21 |title=Optimizing the CVaR via Sampling |url=https://ojs.aaai.org/index.php/AAAI/article/view/9561 |journal=Proceedings of the AAAI Conference on Artificial Intelligence |language=en |volume=29 |issue=1 |doi=10.1609/aaai.v29i1.9561 |issn=2374-3468|arxiv=1404.3862 }}</ref> and blindness to success.<ref>{{Cite journal |last1=Greenberg |first1=Ido |last2=Chow |first2=Yinlam |last3=Ghavamzadeh |first3=Mohammad |last4=Mannor |first4=Shie |date=2022-12-06 |title=Efficient Risk-Averse Reinforcement Learning |url=https://proceedings.neurips.cc/paper_files/paper/2022/hash/d2511dfb731fa336739782ba825cd98c-Abstract-Conference.html |journal=Advances in Neural Information Processing Systems |language=en |volume=35 |pages=32639–32652|arxiv=2205.05138 }}</ref>