Editing Perceptron (section)

== Variants ==
The pocket algorithm with ratchet (Gallant, 1990) solves the stability problem of perceptron learning by keeping the best solution seen so far "in its pocket". The pocket algorithm then returns the solution in the pocket, rather than the last solution. It can be used also for non-separable data sets, where the aim is to find a perceptron with a small number of misclassifications. However, these solutions appear purely stochastically and hence the pocket algorithm neither approaches them gradually in the course of learning, nor are they guaranteed to show up within a given number of learning steps.

The Maxover algorithm (Wendemuth, 1995) is [[Robustness (computer science)|"robust"]] in the sense that it will converge regardless of (prior) knowledge of linear separability of the data set.<ref>{{cite journal |first=A. |last=Wendemuth |title=Learning the Unlearnable |journal=Journal of Physics A: Mathematical and General |volume=28 |issue= 18|pages=5423–5436 |year=1995 |doi=10.1088/0305-4470/28/18/030 |bibcode=1995JPhA...28.5423W }}</ref> In the linearly separable case, it will solve the training problem – if desired, even with optimal stability ([[Hyperplane separation theorem|maximum margin]] between the classes). For non-separable data sets, it will return a solution with a computable small number of misclassifications.<ref>{{cite journal |first=A. |last=Wendemuth |title=Performance of robust training algorithms for neural networks |journal=Journal of Physics A: Mathematical and General |volume=28 |issue= 19|pages=5485–5493 |year=1995 |doi=10.1088/0305-4470/28/19/006 |bibcode=1995JPhA...28.5485W }}</ref> In all cases, the algorithm gradually approaches the solution in the course of learning, without memorizing previous states and without stochastic jumps. Convergence is to global optimality for separable data sets and to local optimality for non-separable data sets.

The Voted Perceptron (Freund and Schapire, 1999), is a variant using multiple weighted perceptrons. The algorithm starts a new perceptron every time an example is wrongly classified, initializing the weights vector with the final weights of the last perceptron. Each perceptron will also be given another weight corresponding to how many examples do they correctly classify before wrongly classifying one, and at the end the output will be a weighted vote on all perceptrons.

In separable problems, perceptron training can also aim at finding the largest separating margin between the classes. The so-called perceptron of optimal stability can be determined by means of iterative training and optimization schemes, such as the Min-Over algorithm (Krauth and Mezard, 1987)<ref name="KrauthMezard87" />  or the AdaTron (Anlauf and Biehl, 1989)).<ref>{{cite journal |first1=J. K. |last1=Anlauf |first2=M. |last2=Biehl |title=The AdaTron: an Adaptive Perceptron algorithm |journal=Europhysics Letters |volume=10 |issue= 7|pages=687–692 |year=1989 |doi=10.1209/0295-5075/10/7/014 |bibcode=1989EL.....10..687A |s2cid=250773895 }}</ref> AdaTron uses the fact that the corresponding quadratic optimization problem is convex. The perceptron of optimal stability, together with the [[kernel trick]], are the conceptual foundations of the [[support-vector machine]].

The <math>\alpha</math>-perceptron further used a pre-processing layer of fixed random weights, with thresholded output units. This enabled the perceptron to classify [[:wiktionary:analogue|analogue]] patterns, by projecting them into a [[Binary Space Partition|binary space]]. In fact, for a projection space of sufficiently high dimension, patterns can become linearly separable.

Another way to solve nonlinear problems without using multiple layers is to use higher order networks (sigma-pi unit). In this type of network, each element in the input vector is extended with each pairwise combination of multiplied inputs (second order). This can be extended to an ''n''-order network.

It should be kept in mind, however, that the best classifier is not necessarily that which classifies all the training data perfectly. Indeed, if we had the prior constraint that the data come from equi-variant Gaussian distributions, the linear separation in the input space is optimal, and the nonlinear solution is [[overfitting|overfitted]].

Other linear classification algorithms include [[Winnow (algorithm)|Winnow]], [[support-vector machine]], and [[logistic regression]].

=== Multiclass perceptron ===
Like most other techniques for training linear classifiers, the perceptron generalizes naturally to [[multiclass classification]].  Here, the input <math>x</math> and the output <math>y</math> are drawn from arbitrary sets. A feature representation function <math>f(x,y)</math> maps each possible input/output pair to a finite-dimensional real-valued feature vector.  As before, the feature vector is multiplied by a weight vector <math>w</math>, but now the resulting score is used to choose among many possible outputs:

:<math>\hat y = \operatorname{argmax}_y f(x,y) \cdot w.</math>

Learning again iterates over the examples, predicting an output for each, leaving the weights unchanged when the predicted output matches the target, and changing them when it does not.  The update becomes:

:<math> w_{t+1} = w_t + f(x, y) - f(x,\hat y).</math>

This multiclass feedback formulation reduces to the original perceptron when <math>x</math> is a real-valued vector, <math>y</math> is chosen from <math>\{0,1\}</math>, and <math>f(x,y) = y x</math>.

For certain problems, input/output representations and features can be chosen so that <math>\mathrm{argmax}_y f(x,y) \cdot w</math> can be found efficiently even though <math>y</math> is chosen from a very large or even infinite set.

Since 2002, perceptron training has become popular in the field of [[natural language processing]] for such tasks as [[part-of-speech tagging]] and [[syntactic parsing]] (Collins, 2002). It has also been applied to large-scale machine learning problems in a [[distributed computing]] setting.<ref>{{cite book |last1=McDonald |first1=R. |last2=Hall |first2=K. |last3=Mann |first3=G. |year=2010 |chapter=Distributed Training Strategies for the Structured Perceptron |title=Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL |pages=456–464 |publisher=Association for Computational Linguistics |chapter-url=https://www.aclweb.org/anthology/N10-1069.pdf }}</ref>