Editing Q-learning (section)

== History ==
''Q''-learning was introduced by Chris Watkins in 1989.<ref>{{cite thesis|type=Ph.D. thesis|last=Watkins|first=C.J.C.H.|year=1989|title=Learning from Delayed Rewards|publisher=[[University of Cambridge]]|url=http://www.cs.rhul.ac.uk/~chrisw/new_thesis.pdf|id={{EThOS|uk.bl.ethos.330022}}}}</ref> A convergence proof was presented by Watkins and [[Peter Dayan]] in 1992.<ref>{{cite journal |last1=Watkins |first1=Chris |last2=Dayan |first2=Peter |year=1992 |title=Q-learning |journal=Machine Learning |volume=8 |issue= 3–4|pages=279–292 |doi=10.1007/BF00992698 |doi-access=free |hdl=21.11116/0000-0002-D738-D |hdl-access=free }}</ref>

Watkins was addressing “Learning from delayed rewards”, the title of his PhD thesis. Eight years earlier in 1981 the same problem, under the name of “Delayed reinforcement learning”, was solved by Bozinovski's Crossbar Adaptive Array (CAA).<ref name="DobnikarSteele1999">{{cite book|editor-last1=Dobnikar|editor-first1=Andrej|editor-last2=Steele|editor-first2=Nigel C.|editor-last3=Pearson|editor-first3=David W.|editor-first4=Rudolf F. |editor-last4=Albrecht|title=Artificial Neural Nets and Genetic Algorithms: Proceedings of the International Conference in Portorož, Slovenia, 1999|chapter-url={{google books |plainurl=y |id=clKwynlfZYkC|page=320-325}}|date=15 July 1999|publisher=Springer Science & Business Media|isbn=978-3-211-83364-3 |first=S. |last=Bozinovski |chapter=Crossbar Adaptive Array: The first connectionist network that solved the delayed reinforcement learning problem|pages=320–325}}</ref><ref name="Trappl1982">{{cite book|editor-last=Trappl|editor-first=Robert|title=Cybernetics and Systems Research: Proceedings of the Sixth European Meeting on Cybernetics and Systems Research|chapter-url={{google books |plainurl=y |id=mGtQAAAAMAAJ|page=397}}|year=1982|publisher=North Holland|isbn=978-0-444-86488-8|first=S. |last=Bozinovski |chapter=A self learning system using secondary reinforcement|pages=397–402}}</ref> The memory matrix <math>W = \|w(a,s)\|</math> was the same as the eight years later Q-table of Q-learning. The architecture introduced the term “state evaluation” in reinforcement learning. The crossbar learning algorithm, written in mathematical [[pseudocode]] in the paper, in each iteration performs the following computation:

* In state {{mvar|s}} perform action {{mvar|a}};
* Receive consequence state {{mvar|s'}}; 
* Compute state evaluation {{tmath|v(s')}}; 
* Update crossbar value <math>w'(a,s) = w(a,s) + v(s')</math>.

The term “secondary reinforcement” is borrowed from animal learning theory, to model state values via [[backpropagation]]: the state value {{tmath|v(s')}} of the consequence situation is backpropagated to the previously encountered situations. CAA computes state values vertically and actions horizontally (the "crossbar"). Demonstration graphs showing delayed reinforcement learning contained states (desirable, undesirable, and neutral states), which were computed by the state evaluation function. This learning system was a forerunner of the Q-learning algorithm.<ref name="OmidvarElliott1997">{{cite book|editor-last1=Omidvar|editor-first1=Omid|editor-last2=Elliott|editor-first2=David L.|title=Neural Systems for Control|chapter-url={{google books |plainurl=y |id=oLcAiySCow0C}}|date=24 February 1997|publisher=Elsevier|isbn=978-0-08-053739-9|first=A. |last=Barto |chapter=Reinforcement learning}}</ref>

In 2014, [[Google DeepMind]] patented<ref>{{cite web|url=https://patentimages.storage.googleapis.com/71/91/4a/c5cf4ffa56f705/US20150100530A1.pdf|title=Methods and Apparatus for Reinforcement Learning, US Patent #20150100530A1|publisher=US Patent Office|date=9 April 2015|access-date=28 July 2018}}</ref> an application of Q-learning to [[deep learning]], titled "deep reinforcement learning" or "deep Q-learning" that can play [[Atari 2600]] games at expert human levels.