Editing Perceptron (section)

== Power of representation ==

===Information theory===

From an [[information theory]] point of view, a single perceptron with ''K'' inputs has a capacity of ''2K'' [[bit]]s of information.<ref name=":2">{{cite book |last=MacKay |first=David |url=https://books.google.com/books?id=AKuMj4PN_EMC&pg=PA483 |title=Information Theory, Inference and Learning Algorithms |date=2003-09-25 |publisher=[[Cambridge University Press]] |isbn=9780521642989 |page=483 |author-link=David J. C. MacKay}}</ref> This result is due to [[Thomas M. Cover|Thomas Cover]].<ref>{{Cite journal |last=Cover |first=Thomas M. |date=June 1965 |title=Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition |url=https://ieeexplore.ieee.org/document/4038449 |journal=IEEE Transactions on Electronic Computers |volume=EC-14 |issue=3 |pages=326–334 |doi=10.1109/PGEC.1965.264137 |issn=0367-7508|url-access=subscription }}</ref>

Specifically let <math>T(N, K)</math> be the number of ways to linearly separate ''N'' points in ''K'' dimensions, then<math display="block">T(N, K)=\left\{\begin{array}{cc}
2^N & K \geq N \\
2 \sum_{k=0}^{K-1}\left(\begin{array}{c}
N-1 \\
k
\end{array}\right) & K<N
\end{array}\right.</math>When ''K'' is large, <math>T(N, K)/2^N</math> is very close to one when <math>N \leq 2K</math>, but very close to zero when <math>N> 2K</math>. In words, one perceptron unit can almost certainly memorize a random assignment of binary labels on N points when <math>N \leq 2K</math>, but almost certainly not when <math>N> 2K</math>.

=== Boolean function ===
When operating on only binary inputs, a perceptron is called a [[Linear separability#Linear separability of Boolean functions in n variables|linearly separable Boolean function]], or threshold Boolean function. The sequence of numbers of threshold Boolean functions on n inputs is [[On-Line Encyclopedia of Integer Sequences|OEIS]] [[oeis:A000609|A000609]]. The value is only known exactly up to <math>n=9</math> case, but the order of magnitude is known quite exactly: it has upper bound <math>2^{n^2 - n \log_2 n + O(n)}</math> and lower bound <math>2^{n^2 - n \log_2 n - O(n)}</math>.<ref name=":4">{{Cite journal |last1=Šíma |first1=Jiří |last2=Orponen |first2=Pekka |date=2003-12-01 |title=General-Purpose Computation with Neural Networks: A Survey of Complexity Theoretic Results |url=https://direct.mit.edu/neco/article/15/12/2727-2778/6791 |journal=Neural Computation |language=en |volume=15 |issue=12 |pages=2727–2778 |doi=10.1162/089976603322518731 |pmid=14629867 |issn=0899-7667|url-access=subscription }}</ref>

Any Boolean linear threshold function can be implemented with only integer weights. Furthermore, the number of bits necessary and sufficient for representing a single integer weight parameter is <math>\Theta(n \ln n)</math>.<ref name=":4" />

===Universal approximation theorem===

* {{main|Universal approximation theorem}}

A single perceptron can learn to classify any half-space. It cannot solve any linearly nonseparable vectors, such as the Boolean [[exclusive-or]] problem (the famous "XOR problem").

A perceptron network with '''one hidden layer''' can learn to classify any compact subset arbitrarily closely. Similarly, it can also approximate any [[Compactly supported|compactly-supported]] [[continuous function]] arbitrarily closely. This is essentially a special case of the [[Universal approximation theorem#Arbitrary-width case|theorems by George Cybenko and Kurt Hornik]].

=== Conjunctively local perceptron ===
{{Main|Perceptrons (book)}}
''Perceptrons'' (Minsky and Papert, 1969) studied the kind of perceptron networks necessary to learn various Boolean functions.

Consider a perceptron network with <math>n</math> input units, one hidden layer, and one output, similar to the Mark I Perceptron machine. It computes a Boolean function of type <math>f: 2^n \to 2 </math>. They call a function '''conjunctively local of order <math>k</math>''', iff there exists a perceptron network such that each unit in the hidden layer connects to at most <math>k</math> input units.

Theorem. (Theorem 3.1.1): The parity function is conjunctively local of order <math>n</math>.

Theorem. (Section 5.5): The connectedness function is conjunctively local of order <math>\Omega(n^{1/2})</math>.