Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Perceptron
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
{{Short description|Algorithm for supervised learning of binary classifiers}} {{Redirect|Perceptrons|the 1969 book|Perceptrons (book)}} {{Machine learning|Supervised learning}} In [[machine learning]], the '''perceptron''' is an algorithm for [[supervised classification|supervised learning]] of [[binary classification|binary classifiers]]. A binary classifier is a function that can decide whether or not an input, represented by a vector of numbers, belongs to some specific class.{{r|largemargin}} It is a type of [[linear classifier]], i.e. a classification algorithm that makes its predictions based on a [[linear predictor function]] combining a set of [[Weighting|weights]] with the [[feature vector]]. ==History== [[File:Mark I perceptron.jpeg|thumb|Mark I Perceptron machine, the first implementation of the perceptron algorithm. It was connected to a camera with 20×20 [[cadmium sulfide]] [[photocell]]s to make a 400-pixel image. The main visible feature is the sensory-to-association plugboard, which sets different combinations of input features. To the right are arrays of [[potentiometer]]s that implemented the adaptive weights.{{r|bishop}}{{rp|213}}|alt=]] {{see also|History of artificial intelligence#Perceptrons}} [[File:330-PSA-80-60 (USN 710739) (20897323365).jpg|thumb|The Mark 1 Perceptron, being adjusted by Charles Wightman (Mark I Perceptron project engineer).<ref>{{Cite book |last=Hecht-Nielsen |first=Robert |title=Neurocomputing |date=1991 |publisher=Addison-Wesley |isbn=978-0-201-09355-1 |edition=Reprint. with corrections |location=Reading (Mass.) Menlo Park (Calif.) New York [etc.] |at=p. 6, Figure 1.3 caption.}}</ref> Sensory units at left, association units in center, and control panel and response units at far right. The sensory-to-association plugboard is behind the closed panel to the right of the operator. The letter "C" on the front panel is a display of the current state of the sensory input.<ref>{{Cite journal |last=Block |first=H. D. |date=1962-01-01 |title=The Perceptron: A Model for Brain Functioning. I |url=https://link.aps.org/doi/10.1103/RevModPhys.34.123 |journal=Reviews of Modern Physics |language=en |volume=34 |issue=1 |pages=123–135 |doi=10.1103/RevModPhys.34.123 |bibcode=1962RvMP...34..123B |issn=0034-6861|url-access=subscription }}</ref>]] The artificial neuron network was invented in 1943 by [[Warren McCulloch]] and [[Walter Pitts]] in ''[[A Logical Calculus of the Ideas Immanent in Nervous Activity|A logical calculus of the ideas immanent in nervous activity]]''.<ref>{{cite journal |last1=McCulloch |first1=W |last2=Pitts |first2=W |title=A Logical Calculus of Ideas Immanent in Nervous Activity |journal=Bulletin of Mathematical Biophysics |date=1943 |volume=5 |issue=4 |pages=115–133 |doi=10.1007/BF02478259 |url=https://www.bibsonomy.org/bibtex/13e8e0d06f376f3eb95af89d5a2f15957/schaul|url-access=subscription }}</ref> In 1957, [[Frank Rosenblatt]] was at the [[Cornell Aeronautical Laboratory]]. He simulated the perceptron on an [[IBM 704]].<ref name=":5">{{cite journal |last=Rosenblatt |first=Frank |year=1957 |title=The Perceptron—a perceiving and recognizing automaton |url=https://bpb-us-e2.wpmucdn.com/websites.umass.edu/dist/a/27637/files/2016/03/rosenblatt-1957.pdf |journal=Report 85-460-1 |publisher=Cornell Aeronautical Laboratory}}</ref><ref>{{Cite journal |last=Rosenblatt |first=Frank |date=March 1960 |title=Perceptron Simulation Experiments |url=https://ieeexplore.ieee.org/document/4066017 |journal=Proceedings of the IRE |volume=48 |issue=3 |pages=301–309 |doi=10.1109/JRPROC.1960.287598 |issn=0096-8390|url-access=subscription }}</ref> Later, he obtained funding by the Information Systems Branch of the United States [[Office of Naval Research]] and the [[Rome Air Development Center]], to build a custom-made computer, the [[Mark I Perceptron]]. It was first publicly demonstrated on 23 June 1960.<ref name=":0" /> The machine was "part of a previously secret four-year NPIC [the US' [[National Photographic Interpretation Center]]] effort from 1963 through 1966 to develop this algorithm into a useful tool for photo-interpreters".<ref name=":1">{{Cite journal |last=O’Connor |first=Jack |date=2022-06-21 |title=Undercover Algorithm: A Secret Chapter in the Early History of Artificial Intelligence and Satellite Imagery |url=https://www.tandfonline.com/doi/full/10.1080/08850607.2022.2073542 |journal=International Journal of Intelligence and CounterIntelligence |language=en |pages=1–15 |doi=10.1080/08850607.2022.2073542 |issn=0885-0607 |s2cid=249946000|url-access=subscription }}</ref> Rosenblatt described the details of the perceptron in a 1958 paper.<ref>{{Cite journal |last=Rosenblatt |first=F. |date=1958 |title=The perceptron: A probabilistic model for information storage and organization in the brain. |url=http://dx.doi.org/10.1037/h0042519 |journal=Psychological Review |volume=65 |issue=6 |pages=386–408 |doi=10.1037/h0042519 |pmid=13602029 |issn=1939-1471|url-access=subscription }}</ref> His organization of a perceptron is constructed of three kinds of cells ("units"): AI, AII, R, which stand for "[[Projection areas|projection]]", "association" and "response". He presented at the first international symposium on AI, ''Mechanisation of Thought Processes'', which took place in 1958 November.<ref>Frank Rosenblatt, ‘''Two Theorems of Statistical Separability in the Perceptron''’, Symposium on the Mechanization of Thought, National Physical Laboratory, Teddington, UK, November 1958, vol. 1, H. M. Stationery Office, London, 1959.</ref> Rosenblatt's project was funded under Contract Nonr-401(40) "Cognitive Systems Research Program", which lasted from 1959 to 1970,<ref>Rosenblatt, Frank, and CORNELL UNIV ITHACA NY. [https://apps.dtic.mil/sti/citations/trecms/AD0720416 ''Cognitive Systems Research Program''.] Technical report, Cornell University, 72, 1971.</ref> and Contract Nonr-2381(00) "Project PARA" ("PARA" means "Perceiving and Recognition Automata"), which lasted from 1957<ref name=":5" /> to 1963.<ref>Muerle, John Ludwig, and CORNELL AERONAUTICAL LAB INC BUFFALO NY. ''[https://apps.dtic.mil/sti/citations/tr/AD0633137 Project Para, Perceiving and Recognition Automata]''. Cornell Aeronautical Laboratory, Incorporated, 1963.</ref> In 1959, the Institute for Defense Analysis awarded his group a $10,000 contract. By September 1961, the ONR awarded further $153,000 worth of contracts, with $108,000 committed for 1962.<ref>{{Cite thesis |last=Penn |first=Jonathan |title=Inventing Intelligence: On the History of Complex Information Processing and Artificial Intelligence in the United States in the Mid-Twentieth Century |date=2021-01-11 |publisher=[object Object] |url=https://www.repository.cam.ac.uk/handle/1810/315976 |doi=10.17863/cam.63087 |language=en}}</ref> The ONR research manager, Marvin Denicoff, stated that ONR, instead of [[DARPA|ARPA]], funded the Perceptron project, because the project was unlikely to produce technological results in the near or medium term. Funding from ARPA go up to the order of millions dollars, while from ONR are on the order of 10,000 dollars. Meanwhile, the head of [[Information Processing Techniques Office|IPTO]] at ARPA, [[J.C.R. Licklider]], was interested in 'self-organizing', 'adaptive' and other biologically-inspired methods in the 1950s; but by the mid-1960s he was openly critical of these, including the perceptron. Instead he strongly favored the logical AI approach of [[Herbert A. Simon|Simon]] and [[Allen Newell|Newell]].<ref>{{Cite journal |last=Guice |first=Jon |date=1998 |title=Controversy and the State: Lord ARPA and Intelligent Computing |url=https://www.jstor.org/stable/285752 |journal=Social Studies of Science |volume=28 |issue=1 |pages=103–138 |doi=10.1177/030631298028001004 |jstor=285752 |pmid=11619937 |issn=0306-3127|url-access=subscription }}</ref> === Mark I Perceptron machine === {{Main article|Mark I Perceptron}} [[File:Organization_of_a_biological_brain_and_a_perceptron.png|thumb|281x281px|Organization of a biological brain and a perceptron.]] The perceptron was intended to be a machine, rather than a program, and while its first implementation was in software for the [[IBM 704]], it was subsequently implemented in custom-built hardware as the [[Mark I Perceptron]] with the project name "Project PARA",<ref name=":6" /> designed for [[image recognition]]. The machine is currently in [[National Museum of American History|Smithsonian National Museum of American History]].<ref>{{Cite web |title=Perceptron, Mark I |url=https://americanhistory.si.edu/collections/search/object/nmah_334414 |access-date=2023-10-30 |website=National Museum of American History |language=en}}</ref> The Mark I Perceptron had three layers. One version was implemented as follows: * An array of 400 [[photocell]]s arranged in a 20x20 grid, named "sensory units" (S-units), or "input retina". Each S-unit can connect to up to 40 A-units. * A hidden layer of 512 perceptrons, named "association units" (A-units). * An output layer of eight perceptrons, named "response units" (R-units). Rosenblatt called this three-layered perceptron network the ''alpha-perceptron'', to distinguish it from other perceptron models he experimented with.<ref name=":0">{{Cite book |last=Nilsson |first=Nils J. |url=https://www.cambridge.org/core/books/quest-for-artificial-intelligence/32C727961B24223BBB1B3511F44F343E |title=The Quest for Artificial Intelligence |date=2009 |publisher=Cambridge University Press |isbn=978-0-521-11639-8 |location=Cambridge |chapter=4.2.1. Perceptrons}}</ref> The S-units are connected to the A-units randomly (according to a table of random numbers) via a plugboard (see photo), to "eliminate any particular intentional bias in the perceptron". The connection weights are fixed, not learned. Rosenblatt was adamant about the random connections, as he believed the retina was randomly connected to the visual cortex, and he wanted his perceptron machine to resemble human visual perception.<ref>{{Cite book |url=https://direct.mit.edu/books/book/4886/Talking-NetsAn-Oral-History-of-Neural-Networks |title=Talking Nets: An Oral History of Neural Networks |date=2000 |publisher=The MIT Press |isbn=978-0-262-26715-1 |editor-last=Anderson |editor-first=James A. |language=en |doi=10.7551/mitpress/6626.003.0004 |editor-last2=Rosenfeld |editor-first2=Edward}}</ref> The A-units are connected to the R-units, with adjustable weights encoded in [[potentiometer]]s, and weight updates during learning were performed by electric motors.<ref name="bishop">{{cite book |last=Bishop |first=Christopher M. |title=Pattern Recognition and Machine Learning |publisher=Springer |year=2006 |isbn=0-387-31073-8}}</ref>{{rp|193}}The hardware details are in an operators' manual.<ref name=":6">{{Cite book |last=Hay |first=John Cameron |url=https://apps.dtic.mil/sti/tr/pdf/AD0236965.pdf |title=Mark I perceptron operators' manual (Project PARA) / |date=1960 |publisher=Cornell Aeronautical Laboratory |location=Buffalo |archive-url=https://web.archive.org/web/20231027213510/https://apps.dtic.mil/sti/tr/pdf/AD0236965.pdf |archive-date=2023-10-27 }}</ref> [[File:Mark I Perceptron, Figure 2 of operator's manual.png|thumb|Components of the Mark I Perceptron. From the operator's manual.<ref name=":6" />]] In a 1958 press conference organized by the US Navy, Rosenblatt made statements about the perceptron that caused a heated controversy among the fledgling [[Artificial intelligence|AI]] community; based on Rosenblatt's statements, ''[[The New York Times]]'' reported the perceptron to be "the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence."<ref name="Olazaran">{{cite journal |last=Olazaran |first=Mikel |year=1996 |title=A Sociological Study of the Official History of the Perceptrons Controversy |journal=Social Studies of Science |volume=26 |issue=3 |pages=611–659 |doi=10.1177/030631296026003005 |jstor=285702 |s2cid=16786738}}</ref> The Photo Division of [[Central Intelligence Agency]], from 1960 to 1964, studied the use of Mark I Perceptron machine for recognizing militarily interesting silhouetted targets (such as planes and ships) in [[Aerial photography|aerial photos]].<ref>{{Cite web |title=Perception Concepts to Photo-Interpretation |url=https://www.cia.gov/readingroom/document/cia-rdp78b04770a002300030027-6 |access-date=2024-11-14 |website=www.cia.gov}}</ref><ref>{{Cite journal |last=Irwin |first=Julia A. |date=2024-09-11 |title=Artificial Worlds and Perceptronic Objects: The CIA's Mid-century Automatic Target Recognition |url=https://direct.mit.edu/grey/article/doi/10.1162/grey_a_00415/124337/Artificial-Worlds-and-Perceptronic-Objects-The-CIA |journal=Grey Room |language=en |issue=97 |pages=6–35 |doi=10.1162/grey_a_00415 |issn=1526-3819|url-access=subscription }}</ref> === ''Principles of Neurodynamics'' (1962) === Rosenblatt described his experiments with many variants of the Perceptron machine in a book ''Principles of Neurodynamics'' (1962). The book is a published version of the 1961 report.<ref>''[[iarchive:DTIC AD0256582/|Principles of neurodynamics: Perceptrons and the theory of brain mechanisms]]'', by Frank Rosenblatt, Report Number VG-1196-G-8, Cornell Aeronautical Laboratory, published on 15 March 1961. The work reported in this volume has been carried out under Contract Nonr-2381 (00) (Project PARA) at C.A.L. and Contract Nonr-401(40), at Cornell Univensity.</ref> Among the variants are: * "cross-coupling" (connections between units within the same layer) with possibly closed loops, * "back-coupling" (connections from units in a later layer to units in a previous layer), * four-layer perceptrons where the last two layers have adjustible weights (and thus a proper multilayer perceptron), * incorporating time-delays to perceptron units, to allow for processing sequential data, * analyzing audio (instead of images). The machine was shipped from Cornell to Smithsonian in 1967, under a government transfer administered by the Office of Naval Research.<ref name=":1" /> === ''Perceptrons'' (1969) === {{Main|Perceptrons (book)}} Although the perceptron initially seemed promising, it was quickly proved that perceptrons could not be trained to recognise many classes of patterns. This caused the field of [[neural network (machine learning)|neural network]] research to stagnate for many years, before it was recognised that a [[feedforward neural network]] with two or more layers (also called a [[multilayer perceptron]]) had greater processing power than perceptrons with one layer (also called a [[Feedforward neural network#A threshold (e.g. activation function) added|single-layer perceptron]]). Single-layer perceptrons are only capable of learning [[linearly separable]] patterns.<ref name="Sejnowski">{{Cite book |last=Sejnowski |first=Terrence J.|author-link=Terry Sejnowski|url=https://books.google.com/books?id=9xZxDwAAQBAJ |title=The Deep Learning Revolution |date=2018|publisher=MIT Press |isbn=978-0-262-03803-4 |language=en|page=47}}</ref> For a classification task with some step activation function, a single node will have a single line dividing the data points forming the patterns. More nodes can create more dividing lines, but those lines must somehow be combined to form more complex classifications. A second layer of perceptrons, or even linear nodes, are sufficient to solve many otherwise non-separable problems. In 1969, a famous book entitled ''[[Perceptrons (book)|Perceptrons]]'' by [[Marvin Minsky]] and [[Seymour Papert]] showed that it was impossible for these classes of network to learn an [[XOR]] function. It is often incorrectly believed that they also conjectured that a similar result would hold for a multi-layer perceptron network. However, this is not true, as both Minsky and Papert already knew that multi-layer perceptrons were capable of producing an XOR function. (See the page on ''[[Perceptrons (book)]]'' for more information.) Nevertheless, the often-miscited Minsky and Papert text caused a significant decline in interest and funding of neural network research. It took ten more years until neural network research experienced a resurgence in the 1980s.<ref name="Sejnowski"/>{{Verify source|date=October 2024|reason=Does the source support all of the preceding text and is "often incorrectly believed" true today or was it only true in the past?}} This text was reprinted in 1987 as "Perceptrons - Expanded Edition" where some errors in the original text are shown and corrected. === Subsequent work === Rosenblatt continued working on perceptrons despite diminishing funding. The last attempt was Tobermory, built between 1961 and 1967, built for speech recognition.<ref>Rosenblatt, Frank (1962). “''[https://web.archive.org/web/20231230210135/https://apps.dtic.mil/sti/tr/pdf/AD0420696.pdf#page=163 A Description of the Tobermory Perceptron]''.” Cognitive Research Program. Report No. 4. Collected Technical Papers, Vol. 2. Edited by Frank Rosenblatt. Ithaca, NY: Cornell University.</ref> It occupied an entire room.<ref name=":7">Nagy, George. 1963. ''[https://web.archive.org/web/20231230204827/https://apps.dtic.mil/sti/trecms/pdf/AD0607459.pdf System and circuit designs for the Tobermory perceptron]''. Technical report number 5, Cognitive Systems Research Program, Cornell University, Ithaca New York.</ref> It had 4 layers with 12,000 weights implemented by toroidal [[magnetic core]]s. By the time of its completion, simulation on digital computers had become faster than purpose-built perceptron machines.<ref>Nagy, George. "Neural networks-then and now." ''IEEE Transactions on Neural Networks'' 2.2 (1991): 316-318.</ref> He died in a boating accident in 1971. [[File:Isometric view of Tobermory Phase I.png|thumb|Isometric view of Tobermory Phase I.<ref name=":7" />]] The [[kernel perceptron]] algorithm was already introduced in 1964 by Aizerman et al.<ref>{{cite journal |last1=Aizerman |first1=M. A. |last2=Braverman |first2=E. M. |last3=Rozonoer |first3=L. I. |year=1964 |title=Theoretical foundations of the potential function method in pattern recognition learning |journal=Automation and Remote Control |volume=25 |pages=821–837 }}</ref> Margin bounds guarantees were given for the Perceptron algorithm in the general non-separable case first by [[Yoav Freund|Freund]] and [[Robert Schapire|Schapire]] (1998),<ref name="largemargin">{{Cite journal |doi=10.1023/A:1007662407062 |year=1999 |title=Large margin classification using the perceptron algorithm |last1=Freund |first1=Y. |author-link1=Yoav Freund |journal=[[Machine Learning (journal)|Machine Learning]] |volume=37 |issue=3 |pages=277–296 |last2=Schapire |first2=R. E. |s2cid=5885617 |author-link2=Robert Schapire |url=http://cseweb.ucsd.edu/~yfreund/papers/LargeMarginsUsingPerceptron.pdf|doi-access=free }}</ref> and more recently by [[Mehryar Mohri|Mohri]] and Rostamizadeh (2013) who extend previous results and give new and more favorable L1 bounds.<ref>{{cite arXiv |last1=Mohri |first1=Mehryar |last2=Rostamizadeh |first2=Afshin |title=Perceptron Mistake Bounds |eprint=1305.0208 |year=2013 |class=cs.LG }}</ref><ref>[https://mitpress.mit.edu/books/foundations-machine-learning-second-edition] Foundations of Machine Learning, MIT Press (Chapter 8).</ref> The perceptron is a simplified model of a biological [[neuron]]. While the complexity of [[biological neuron model]]s is often required to fully understand neural behavior, research suggests a perceptron-like linear model can produce some behavior seen in real neurons.<ref>{{cite journal |last1=Cash |first1=Sydney |first2=Rafael |last2=Yuste |title=Linear Summation of Excitatory Inputs by CA1 Pyramidal Neurons |journal=[[Neuron (journal)|Neuron]] |volume=22 |issue=2 |year=1999 |pages=383–394 |doi=10.1016/S0896-6273(00)81098-3 |pmid=10069343 |doi-access=free }}</ref> The solution spaces of decision boundaries for all binary functions and learning behaviors are studied in.<ref>{{cite book |last1=Liou |first1=D.-R. |title=Learning Behaviors of Perceptron |last2=Liou |first2=J.-W. |last3=Liou |first3=C.-Y. |publisher=iConcept Press |year=2013 |isbn=978-1-477554-73-9}}</ref> == Definition == [[Image:Perceptron.svg|right|thumb|353x353px|The appropriate weights are applied to the inputs, and the resulting weighted sum passed to a function that produces the output o.]]In the modern sense, the perceptron is an algorithm for learning a binary classifier called a [[Linear classifier#Definition|threshold function]]: a function that maps its input <math>\mathbf{x}</math> (a real-valued [[Vector space|vector]]) to an output value <math>f(\mathbf{x})</math> (a single [[Binary function|binary]] value): <math display="block"> f(\mathbf{x}) = h(\mathbf{w} \cdot \mathbf{x} + b) </math> where <math>h</math> is the [[Heaviside step function|Heaviside step-function]] (where an input of <math display="inline"> > 0</math> outputs 1; otherwise 0 is the output ), <math>\mathbf{w}</math> is a vector of real-valued weights, <math>\mathbf{w} \cdot \mathbf{x}</math> is the [[dot product]] <math display="inline">\sum_{i=1}^m w_i x_i</math>, where {{mvar|m}} is the number of inputs to the perceptron, and {{mvar|b}} is the ''bias''. The bias shifts the decision boundary away from the origin and does not depend on any input value. Equivalently, since <math>\mathbf{w}\cdot \mathbf{x} + b = (\mathbf{w}, b) \cdot (\mathbf{x}, 1)</math>, we can add the bias term <math>b</math> as another weight <math>\mathbf{w}_{m+1}</math> and add a coordinate <math>1</math> to each input <math>\mathbf{x}</math>, and then write it as a linear classifier that passes the origin:<math display="block"> f(\mathbf{x}) = h(\mathbf{w} \cdot \mathbf{x}) </math> The binary value of <math>f(\mathbf{x})</math> (0 or 1) is used to perform binary classification on <math>\mathbf{x}</math> as either a positive or a negative instance. Spatially, the bias shifts the position (though not the orientation) of the planar [[decision boundary]]. In the context of neural networks, a perceptron is an [[artificial neuron]] using the [[Heaviside step function]] as the activation function. The perceptron algorithm is also termed the '''single-layer perceptron''', to distinguish it from a [[multilayer perceptron]], which is a misnomer for a more complicated neural network. As a linear classifier, the single-layer perceptron is the simplest [[feedforward neural network]]. == Power of representation == ===Information theory=== From an [[information theory]] point of view, a single perceptron with ''K'' inputs has a capacity of ''2K'' [[bit]]s of information.<ref name=":2">{{cite book |last=MacKay |first=David |url=https://books.google.com/books?id=AKuMj4PN_EMC&pg=PA483 |title=Information Theory, Inference and Learning Algorithms |date=2003-09-25 |publisher=[[Cambridge University Press]] |isbn=9780521642989 |page=483 |author-link=David J. C. MacKay}}</ref> This result is due to [[Thomas M. Cover|Thomas Cover]].<ref>{{Cite journal |last=Cover |first=Thomas M. |date=June 1965 |title=Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition |url=https://ieeexplore.ieee.org/document/4038449 |journal=IEEE Transactions on Electronic Computers |volume=EC-14 |issue=3 |pages=326–334 |doi=10.1109/PGEC.1965.264137 |issn=0367-7508|url-access=subscription }}</ref> Specifically let <math>T(N, K)</math> be the number of ways to linearly separate ''N'' points in ''K'' dimensions, then<math display="block">T(N, K)=\left\{\begin{array}{cc} 2^N & K \geq N \\ 2 \sum_{k=0}^{K-1}\left(\begin{array}{c} N-1 \\ k \end{array}\right) & K<N \end{array}\right.</math>When ''K'' is large, <math>T(N, K)/2^N</math> is very close to one when <math>N \leq 2K</math>, but very close to zero when <math>N> 2K</math>. In words, one perceptron unit can almost certainly memorize a random assignment of binary labels on N points when <math>N \leq 2K</math>, but almost certainly not when <math>N> 2K</math>. === Boolean function === When operating on only binary inputs, a perceptron is called a [[Linear separability#Linear separability of Boolean functions in n variables|linearly separable Boolean function]], or threshold Boolean function. The sequence of numbers of threshold Boolean functions on n inputs is [[On-Line Encyclopedia of Integer Sequences|OEIS]] [[oeis:A000609|A000609]]. The value is only known exactly up to <math>n=9</math> case, but the order of magnitude is known quite exactly: it has upper bound <math>2^{n^2 - n \log_2 n + O(n)}</math> and lower bound <math>2^{n^2 - n \log_2 n - O(n)}</math>.<ref name=":4">{{Cite journal |last1=Šíma |first1=Jiří |last2=Orponen |first2=Pekka |date=2003-12-01 |title=General-Purpose Computation with Neural Networks: A Survey of Complexity Theoretic Results |url=https://direct.mit.edu/neco/article/15/12/2727-2778/6791 |journal=Neural Computation |language=en |volume=15 |issue=12 |pages=2727–2778 |doi=10.1162/089976603322518731 |pmid=14629867 |issn=0899-7667|url-access=subscription }}</ref> Any Boolean linear threshold function can be implemented with only integer weights. Furthermore, the number of bits necessary and sufficient for representing a single integer weight parameter is <math>\Theta(n \ln n)</math>.<ref name=":4" /> ===Universal approximation theorem=== * {{main|Universal approximation theorem}} A single perceptron can learn to classify any half-space. It cannot solve any linearly nonseparable vectors, such as the Boolean [[exclusive-or]] problem (the famous "XOR problem"). A perceptron network with '''one hidden layer''' can learn to classify any compact subset arbitrarily closely. Similarly, it can also approximate any [[Compactly supported|compactly-supported]] [[continuous function]] arbitrarily closely. This is essentially a special case of the [[Universal approximation theorem#Arbitrary-width case|theorems by George Cybenko and Kurt Hornik]]. === Conjunctively local perceptron === {{Main|Perceptrons (book)}} ''Perceptrons'' (Minsky and Papert, 1969) studied the kind of perceptron networks necessary to learn various Boolean functions. Consider a perceptron network with <math>n</math> input units, one hidden layer, and one output, similar to the Mark I Perceptron machine. It computes a Boolean function of type <math>f: 2^n \to 2 </math>. They call a function '''conjunctively local of order <math>k</math>''', iff there exists a perceptron network such that each unit in the hidden layer connects to at most <math>k</math> input units. Theorem. (Theorem 3.1.1): The parity function is conjunctively local of order <math>n</math>. Theorem. (Section 5.5): The connectedness function is conjunctively local of order <math>\Omega(n^{1/2})</math>. == Learning algorithm for a single-layer perceptron == [[File:Perceptron_example.svg|right|thumb|310x310px|A diagram showing a perceptron updating its linear boundary as more training examples are added]] Below is an example of a learning algorithm for a single-layer perceptron with a single output unit. For a single-layer perceptron with multiple output units, since the weights of one output unit are completely separate from all the others', the same algorithm can be run for each output unit. For [[multilayer perceptron]]s, where a hidden layer exists, more sophisticated algorithms such as [[backpropagation]] must be used. If the activation function or the underlying process being modeled by the perceptron is [[Nonlinear system|nonlinear]], alternative learning algorithms such as the [[delta rule]] can be used as long as the activation function is [[Differentiable function|differentiable]]. Nonetheless, the learning algorithm described in the steps below will often work, even for multilayer perceptrons with nonlinear activation functions. When multiple perceptrons are combined in an artificial neural network, each output neuron operates independently of all the others; thus, learning each output can be considered in isolation. === Definitions === We first define some variables: *<math>r</math> is the [[learning rate]] of the perceptron. Learning rate is a positive number usually chosen to be less than 1. The larger the value, the greater the chance for volatility in the weight changes. *<math>y = f(\mathbf{z}) </math> denotes the ''output'' from the perceptron for an input vector <math>\mathbf{z}</math>. *<math>D = \{(\mathbf{x}_1,d_1),\dots,(\mathbf{x}_s,d_s)\} </math> is the ''training set'' of <math>s</math> samples, where: ** <math>\mathbf{x}_j</math> is the <math>n</math>-dimensional input vector. ** <math>d_j </math> is the desired output value of the perceptron for that input. We show the values of the features as follows: *<math>x_{j,i} </math> is the value of the <math>i</math>th feature of the <math>j</math>th training ''input vector''. *<math>x_{j,0} = 1 </math>. To represent the weights: *<math>w_i </math> is the <math>i</math>th value in the ''weight vector'', to be multiplied by the value of the <math>i</math>th input feature. *Because <math>x_{j,0} = 1 </math>, the <math>w_0 </math> is effectively a bias that we use instead of the bias constant <math>b</math>. To show the time-dependence of <math>\mathbf{w}</math>, we use: *<math>w_i(t) </math> is the weight <math>i</math> at time <math>t</math>. === Steps=== {{Ordered list | Initialize the weights. Weights may be initialized to 0 or to a small random value. In the example below, we use 0. | For each example {{mvar|j}} in our training set {{mvar|D}}, perform the following steps over the input <math>\mathbf{x}_j </math> and desired output <math>d_j </math>: {{Ordered list |list_style_type=lower-alpha |Calculate the actual output: :<math>\begin{align} y_j(t) &= f[\mathbf{w}(t)\cdot\mathbf{x}_j] \\ &= f[w_0(t)x_{j,0} + w_1(t)x_{j,1} + w_2(t)x_{j,2} + \dotsb + w_n(t)x_{j,n}] \end{align}</math> |Update the weights: :<math>w_i(t+1) = w_i(t) \; \boldsymbol{+} \; r\cdot(d_j - y_j(t)) x_{j,i} </math>, for all features <math>0 \leq i \leq n</math>, <math>r</math> is the [[learning rate]]. }} | For [[offline learning]], the second step may be repeated until the iteration error <math>\frac{1}{s} \sum_{j=1}^s |d_j - y_j(t)| </math> is less than a user-specified error threshold <math>\gamma </math>, or a predetermined number of iterations have been completed, where ''s'' is again the size of the sample set. }} The algorithm updates the weights after every training sample in step 2b. ===Convergence of one perceptron on a linearly separable dataset=== [[File:Perceptron convergence theorem.svg|thumb|284x284px|Illustration of the perceptron convergence. In the picture, <math>\gamma = 0.01, R = 1, r = 1 </math>. All data points have <math>y = +1</math>, since the negative samples are equivalent to <math>y = +1</math> after reflection through the origin. As the learning proceeds, the weight vector performs a somewhat random walk in the space of weights. Each step is at least 90 degrees away from its current direction, thus increasing its norm-square by at most <math>R</math>. Each step adds to <math>w</math> by a point in the samples, and since all the samples have <math>x_1 \geq 0.01</math>, the weight vector must move along <math>x_1</math> by at least <math>0.01</math>. Since the norm grows like <math>\sqrt t</math> but the <math>x_1</math>-component grows like <math>t</math>, this would eventually force the weight vector to point almost entirely in the <math>x_1</math> direction, and thus achieve convergence.]] A single perceptron is a [[linear classifier]]. It can only reach a stable state if all input vectors are classified correctly. In case the training set {{mvar|D}} is ''not'' [[linearly separable]], i.e. if the positive examples cannot be separated from the negative examples by a hyperplane, then the algorithm would not converge since there is no solution. Hence, if linear separability of the training set is not known a priori, one of the training variants below should be used. Detailed analysis and extensions to the convergence theorem are in Chapter 11 of ''Perceptrons'' (1969). Linear separability is testable in time <math>\min(O(n^{d/2}), O(d^{2n}), O(n^{d-1} \ln n)) </math>, where <math>n</math> is the number of data points, and <math>d</math> is the dimension of each point.<ref>{{Cite web |title=Introduction to Machine Learning, Chapter 3: Perceptron |url=https://openlearninglibrary.mit.edu/courses/course-v1:MITx+6.036+1T2019/courseware/Week2/perceptron/?activate_block_id=block-v1:MITx+6.036+1T2019+type@sequential+block@perceptron |access-date=2023-10-27 |website=openlearninglibrary.mit.edu |language=en}}</ref> If the training set ''is'' linearly separable, then the perceptron is guaranteed to converge after making finitely many mistakes.<ref>{{Cite journal|last=Novikoff|first=Albert J.|date=1963|title=On convergence proofs for perceptrons|journal=Office of Naval Research}}</ref> The theorem is proved by Rosenblatt et al. {{Math theorem|name=Perceptron convergence theorem|note=|math_statement= Given a dataset <math display="inline">D</math>, such that <math display="inline">\max_{(x,y) \in D}\|x\|_2 = R</math>, and it is linearly separable by some unit vector <math display="inline">w^*</math>, with margin <math display="inline">\gamma</math>: <math display="block">\gamma := \min_{(x,y) \in D} y(w^*\cdot x )</math> Then the perceptron 0-1 learning algorithm converges after making at most <math display="inline">(R/\gamma)^2</math> mistakes, for any learning rate, and any method of sampling from the dataset. }}The following simple proof is due to Novikoff (1962). The idea of the proof is that the weight vector is always adjusted by a bounded amount in a direction with which it has a negative [[dot product]], and thus can be bounded above by {{math|''O''({{sqrt|''t''}})}}, where {{mvar|t}} is the number of changes to the weight vector. However, it can also be bounded below by {{math|''O''(''t'')}} because if there exists an (unknown) satisfactory weight vector, then every change makes progress in this (unknown) direction by a positive amount that depends only on the input vector.{{Math proof|title=Proof|proof= Suppose at step <math display="inline">t</math>, the perceptron with weight <math display="inline">w_t</math> makes a mistake on data point <math display="inline">(x, y)</math>, then it updates to <math display="inline">w_{t+1} = w_t + r(y-f_{w_t}(x) ) x</math>. If <math display="inline">y = 0</math>, the argument is symmetric, so we omit it. [[WLOG]], <math display="inline">y = 1</math>, then <math display="inline">f_{w_t}(x) = 0</math>, <math display="inline">f_{w^*}(x) = 1</math>, and <math display="inline">w_{t+1} = w_t + rx</math>. By assumption, we have separation with margins: <math display="block">w^* \cdot x \geq \gamma</math> Thus,<br /> <math display="block">w^* \cdot w_{t+1} - w^* \cdot w_{t} = w^* \cdot (rx) \geq r\gamma</math> Also <math display="block">\|w_{t+1}\|_2^2 - \|w_{t}\|_2^2 = \|w_{t} + rx\|_2^2 - \|w_{t}\|_2^2 = 2r (w_t \cdot x) + r^2 \|x\|_2^2</math> and since the perceptron made a mistake, <math display="inline">w_t \cdot x \leq 0</math>, and so<br /> <math display="block">\|w_{t+1}\|_2^2 - \|w_{t}\|_2^2 \leq \|x\|_2^2 \leq r^2R^2</math> Since we started with <math display="inline">w_0 = 0</math>, after making <math display="inline">N</math> mistakes, <math display="block">\|w\|_2 \leq \sqrt{Nr^2R^2}</math> but also<br /> <math display="block">\|w\|_2 \geq w \cdot w^* \geq Nr\gamma</math> Combining the two, we have <math display="inline">N \leq (R/\gamma)^2</math> }} [[File:Perceptron cant choose.svg|thumb|300px|Two classes of points, and two of the infinitely many linear boundaries that separate them. Even though the boundaries are at nearly right angles to one another, the perceptron algorithm has no way of choosing between them.]] While the perceptron algorithm is guaranteed to converge on ''some'' solution in the case of a linearly separable training set, it may still pick ''any'' solution and problems may admit many solutions of varying quality.<ref>{{cite book |last=Bishop |first=Christopher M |title=Pattern Recognition and Machine Learning |publisher=Springer Science+Business Media, LLC |isbn=978-0387-31073-2 |chapter=Chapter 4. Linear Models for Classification |pages=194|date=2006-08-17 }}</ref> The ''perceptron of optimal stability'', nowadays better known as the linear [[support-vector machine]], was designed to solve this problem (Krauth and [[Marc Mézard|Mezard]], 1987).<ref name="KrauthMezard87">{{cite journal |first1=W. |last1=Krauth |first2=M. |last2=Mezard |title=Learning algorithms with optimal stability in neural networks |journal=Journal of Physics A: Mathematical and General |volume=20 |issue= 11|pages=L745–L752 |year=1987 |doi=10.1088/0305-4470/20/11/013 |bibcode=1987JPhA...20L.745K }}</ref> === Perceptron cycling theorem === When the dataset is not linearly separable, then there is no way for a single perceptron to converge. However, we still have<ref>{{Cite journal |last1=Block |first1=H. D. |last2=Levin |first2=S. A. |date=1970 |title=On the boundedness of an iterative procedure for solving a system of linear inequalities |url=https://www.ams.org/proc/1970-026-02/S0002-9939-1970-0265383-5/ |journal=Proceedings of the American Mathematical Society |language=en |volume=26 |issue=2 |pages=229–235 |doi=10.1090/S0002-9939-1970-0265383-5 |issn=0002-9939|doi-access=free }}</ref> {{Math theorem | name = Perceptron cycling theorem | note = | math_statement = If the dataset <math>D</math> has only finitely many points, then there exists an upper bound number <math>M</math>, such that for any starting weight vector <math>w_0</math> all weight vector <math>w_t</math> has norm bounded by <math>\|w_t\| \leq \|w_0\|+M</math> }}This is proved first by [[Bradley Efron]].<ref>Efron, Bradley. "The perceptron correction procedure in nonseparable situations." ''Rome Air Dev. Center Tech. Doc. Rept'' (1964).</ref> === Learning a Boolean function === Consider a dataset where the <math>x</math> are from <math>\{-1, +1\}^n</math>, that is, the vertices of an n-dimensional hypercube centered at origin, and <math>y = \theta(x_i)</math>. That is, all data points with positive <math>x_i</math> have <math>y=1</math>, and vice versa. By the perceptron convergence theorem, a perceptron would converge after making at most <math>n</math> mistakes. If we were to write a logical program to perform the same task, each positive example shows that one of the coordinates is the right one, and each negative example shows that its ''complement'' is a positive example. By collecting all the known positive examples, we eventually eliminate all but one coordinate, at which point the dataset is learned.<ref name=":3">{{Cite book |last1=Simon |first1=Herbert A. |title=The Sciences of the Artificial, reissue of the third edition with a new introduction by John Laird |last2=Laird |first2=John E. |date=2019-08-13 |publisher=The MIT Press |isbn=978-0-262-53753-7 |edition=Reissue |location=Cambridge, Massachusetts London, England |language=English |chapter=Limits on Speed of Concept Attainment}}</ref> This bound is asymptotically tight in terms of the worst-case. In the worst-case, the first presented example is entirely new, and gives <math>n</math> bits of information, but each subsequent example would differ minimally from previous examples, and gives 1 bit each. After <math>n+1</math> examples, there are <math>2n</math> bits of information, which is sufficient for the perceptron (with <math>2n</math> bits of information).<ref name=":2" /> However, it is not tight in terms of expectation if the examples are presented uniformly at random, since the first would give <math>n</math> bits, the second <math>n/2</math> bits, and so on, taking <math>O(\ln n)</math> examples in total.<ref name=":3" /> == Variants == The pocket algorithm with ratchet (Gallant, 1990) solves the stability problem of perceptron learning by keeping the best solution seen so far "in its pocket". The pocket algorithm then returns the solution in the pocket, rather than the last solution. It can be used also for non-separable data sets, where the aim is to find a perceptron with a small number of misclassifications. However, these solutions appear purely stochastically and hence the pocket algorithm neither approaches them gradually in the course of learning, nor are they guaranteed to show up within a given number of learning steps. The Maxover algorithm (Wendemuth, 1995) is [[Robustness (computer science)|"robust"]] in the sense that it will converge regardless of (prior) knowledge of linear separability of the data set.<ref>{{cite journal |first=A. |last=Wendemuth |title=Learning the Unlearnable |journal=Journal of Physics A: Mathematical and General |volume=28 |issue= 18|pages=5423–5436 |year=1995 |doi=10.1088/0305-4470/28/18/030 |bibcode=1995JPhA...28.5423W }}</ref> In the linearly separable case, it will solve the training problem – if desired, even with optimal stability ([[Hyperplane separation theorem|maximum margin]] between the classes). For non-separable data sets, it will return a solution with a computable small number of misclassifications.<ref>{{cite journal |first=A. |last=Wendemuth |title=Performance of robust training algorithms for neural networks |journal=Journal of Physics A: Mathematical and General |volume=28 |issue= 19|pages=5485–5493 |year=1995 |doi=10.1088/0305-4470/28/19/006 |bibcode=1995JPhA...28.5485W }}</ref> In all cases, the algorithm gradually approaches the solution in the course of learning, without memorizing previous states and without stochastic jumps. Convergence is to global optimality for separable data sets and to local optimality for non-separable data sets. The Voted Perceptron (Freund and Schapire, 1999), is a variant using multiple weighted perceptrons. The algorithm starts a new perceptron every time an example is wrongly classified, initializing the weights vector with the final weights of the last perceptron. Each perceptron will also be given another weight corresponding to how many examples do they correctly classify before wrongly classifying one, and at the end the output will be a weighted vote on all perceptrons. In separable problems, perceptron training can also aim at finding the largest separating margin between the classes. The so-called perceptron of optimal stability can be determined by means of iterative training and optimization schemes, such as the Min-Over algorithm (Krauth and Mezard, 1987)<ref name="KrauthMezard87" /> or the AdaTron (Anlauf and Biehl, 1989)).<ref>{{cite journal |first1=J. K. |last1=Anlauf |first2=M. |last2=Biehl |title=The AdaTron: an Adaptive Perceptron algorithm |journal=Europhysics Letters |volume=10 |issue= 7|pages=687–692 |year=1989 |doi=10.1209/0295-5075/10/7/014 |bibcode=1989EL.....10..687A |s2cid=250773895 }}</ref> AdaTron uses the fact that the corresponding quadratic optimization problem is convex. The perceptron of optimal stability, together with the [[kernel trick]], are the conceptual foundations of the [[support-vector machine]]. The <math>\alpha</math>-perceptron further used a pre-processing layer of fixed random weights, with thresholded output units. This enabled the perceptron to classify [[:wiktionary:analogue|analogue]] patterns, by projecting them into a [[Binary Space Partition|binary space]]. In fact, for a projection space of sufficiently high dimension, patterns can become linearly separable. Another way to solve nonlinear problems without using multiple layers is to use higher order networks (sigma-pi unit). In this type of network, each element in the input vector is extended with each pairwise combination of multiplied inputs (second order). This can be extended to an ''n''-order network. It should be kept in mind, however, that the best classifier is not necessarily that which classifies all the training data perfectly. Indeed, if we had the prior constraint that the data come from equi-variant Gaussian distributions, the linear separation in the input space is optimal, and the nonlinear solution is [[overfitting|overfitted]]. Other linear classification algorithms include [[Winnow (algorithm)|Winnow]], [[support-vector machine]], and [[logistic regression]]. === Multiclass perceptron === Like most other techniques for training linear classifiers, the perceptron generalizes naturally to [[multiclass classification]]. Here, the input <math>x</math> and the output <math>y</math> are drawn from arbitrary sets. A feature representation function <math>f(x,y)</math> maps each possible input/output pair to a finite-dimensional real-valued feature vector. As before, the feature vector is multiplied by a weight vector <math>w</math>, but now the resulting score is used to choose among many possible outputs: :<math>\hat y = \operatorname{argmax}_y f(x,y) \cdot w.</math> Learning again iterates over the examples, predicting an output for each, leaving the weights unchanged when the predicted output matches the target, and changing them when it does not. The update becomes: :<math> w_{t+1} = w_t + f(x, y) - f(x,\hat y).</math> This multiclass feedback formulation reduces to the original perceptron when <math>x</math> is a real-valued vector, <math>y</math> is chosen from <math>\{0,1\}</math>, and <math>f(x,y) = y x</math>. For certain problems, input/output representations and features can be chosen so that <math>\mathrm{argmax}_y f(x,y) \cdot w</math> can be found efficiently even though <math>y</math> is chosen from a very large or even infinite set. Since 2002, perceptron training has become popular in the field of [[natural language processing]] for such tasks as [[part-of-speech tagging]] and [[syntactic parsing]] (Collins, 2002). It has also been applied to large-scale machine learning problems in a [[distributed computing]] setting.<ref>{{cite book |last1=McDonald |first1=R. |last2=Hall |first2=K. |last3=Mann |first3=G. |year=2010 |chapter=Distributed Training Strategies for the Structured Perceptron |title=Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL |pages=456–464 |publisher=Association for Computational Linguistics |chapter-url=https://www.aclweb.org/anthology/N10-1069.pdf }}</ref> ==References== {{Reflist}} ==Further reading== * Aizerman, M. A. and Braverman, E. M. and Lev I. Rozonoer. Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25:821–837, 1964. * [[Frank Rosenblatt|Rosenblatt, Frank]] (1958), The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain, Cornell Aeronautical Laboratory, Psychological Review, v65, No. 6, pp. 386–408. {{doi|10.1037/h0042519}}. * [[Frank Rosenblatt|Rosenblatt, Frank]] (1962), Principles of Neurodynamics. Washington, DC: Spartan Books. * [[Marvin Minsky|Minsky, M. L.]] and Papert, S. A. 1969. ''Perceptrons''. Cambridge, MA: MIT Press. * Gallant, S. I. (1990). [https://ieeexplore.ieee.org/document/80230/;jsessionid=EA330364D3E5FFD7513BE8789467267E?arnumber=80230 Perceptron-based learning algorithms.] IEEE Transactions on Neural Networks, vol. 1, no. 2, pp. 179–191. * Olazaran Rodriguez, Jose Miguel. ''[https://web.archive.org/web/20221111165150/https://era.ed.ac.uk/bitstream/handle/1842/20075/Olazaran-RodriguezJM_1991redux.pdf?sequence=1&isAllowed=y A historical sociology of neural network research]''. PhD Dissertation. University of Edinburgh, 1991. * Mohri, Mehryar and Rostamizadeh, Afshin (2013). [https://arxiv.org/abs/1305.0208 Perceptron Mistake Bounds] arXiv:1305.0208, 2013. * Novikoff, A. B. (1962). On convergence proofs on perceptrons. Symposium on the Mathematical Theory of Automata, 12, 615–622. Polytechnic Institute of Brooklyn. * [[Bernard Widrow|Widrow, B.]], Lehr, M.A., "[http://www.inf.ufrgs.br/~engel/data/media/file/cmp121/widrow.pdf 30 years of Adaptive Neural Networks: Perceptron, Madaline, and Backpropagation]," ''Proc. IEEE'', vol 78, no 9, pp. 1415–1442, (1990). * [[Michael Collins (computational linguist)|Collins, M.]] 2002. [https://www.aclweb.org/anthology/W02-1001 Discriminative training methods for hidden Markov models: Theory and experiments with the perceptron algorithm] in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '02). * Yin, Hongfeng (1996), Perceptron-Based Algorithms and Analysis, Spectrum Library, Concordia University, Canada == External links == * [http://www.mathworks.com/matlabcentral/fileexchange/32949-a-perceptron-learns-to-perform-a-binary-nand-function/content/PerceptronImpl.m A Perceptron implemented in MATLAB to learn binary NAND function] * Chapter 3 [http://page.mi.fu-berlin.de/rojas/neural/chapter/K3.pdf Weighted networks - the perceptron] and chapter 4 [http://page.mi.fu-berlin.de/rojas/neural/chapter/K4.pdf Perceptron learning] of [http://page.mi.fu-berlin.de/rojas/neural/index.html.html ''Neural Networks - A Systematic Introduction''] by [[Raúl Rojas]] ({{ISBN|978-3-540-60505-8}}) * [http://www.csulb.edu/~cwallis/artificialn/History.htm History of perceptrons] * [http://www.cis.hut.fi/ahonkela/dippa/node41.html Mathematics of multilayer perceptrons] * Applying a perceptron model using [[scikit-learn]] - https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html {{Differentiable computing}} {{Authority control}} [[Category:Classification algorithms]] [[Category:Artificial neural networks]]
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)
Pages transcluded onto the current version of this page
(
help
)
:
Template:Authority control
(
edit
)
Template:Cite arXiv
(
edit
)
Template:Cite book
(
edit
)
Template:Cite journal
(
edit
)
Template:Cite thesis
(
edit
)
Template:Cite web
(
edit
)
Template:Differentiable computing
(
edit
)
Template:Doi
(
edit
)
Template:ISBN
(
edit
)
Template:Machine learning
(
edit
)
Template:Main
(
edit
)
Template:Main article
(
edit
)
Template:Math
(
edit
)
Template:Math proof
(
edit
)
Template:Math theorem
(
edit
)
Template:Mvar
(
edit
)
Template:Ordered list
(
edit
)
Template:R
(
edit
)
Template:Redirect
(
edit
)
Template:Reflist
(
edit
)
Template:Rp
(
edit
)
Template:See also
(
edit
)
Template:Short description
(
edit
)
Template:Verify source
(
edit
)