Editing Support vector machine (section)

== Motivation ==
[[Image:Svm separating hyperplanes (SVG).svg|thumb|right|H<sub>1</sub> does not separate the classes. H<sub>2</sub> does, but only with a small margin.  H<sub>3</sub> separates them with the maximal margin.]]
[[Statistical classification|Classifying data]] is a common task in [[machine learning]].
Suppose some given data points each belong to one of two classes, and the goal is to decide which class a ''new'' [[data point]] will be in. In the case of support vector machines, a data point is viewed as a <math>p</math>-dimensional vector (a list of <math>p</math> numbers), and we want to know whether we can separate such points with a <math>(p-1)</math>-dimensional [[hyperplane]]. This is called a [[linear classifier]]. There are many hyperplanes that might classify the data. One reasonable choice as the best hyperplane is the one that represents the largest separation, or [[Margin (machine learning)|margin]], between the two classes. So we choose the hyperplane so that the distance from it to the nearest data point on each side is maximized. If such a hyperplane exists, it is known as the ''[[maximum-margin hyperplane]]'' and the linear classifier it defines is known as a ''maximum-[[margin classifier]]''; or equivalently, the ''[[Perceptron#Convergence|perceptron of optimal stability]]''.<ref>{{cite journal
 | last1 = Opper
 | first1 = M
 | last2 = Kinzel
 | first2 = W
 | last3 = Kleinz
 | first3 = J
 | last4 = Nehl
 | first4 = R
 | title = On the ability of the optimal perceptron to generalise
 | journal = Journal of Physics A: Mathematical and General
 | volume = 23
 | issue = 11
 | pages = L581
 | year = 1990
 | doi = 10.1088/0305-4470/23/11/012
 | bibcode = 1990JPhA...23L.581O
 | url = https://dx.doi.org/10.1088/0305-4470/23/11/012
}}</ref>

More formally, a support vector machine constructs a [[hyperplane]] or set of hyperplanes in a high or infinite-dimensional space, which can be used for [[Statistical classification|classification]], [[Regression analysis|regression]], or other tasks like outliers detection.<ref>{{cite web |url=http://scikit-learn.org/stable/modules/svm.html |title=1.4. Support Vector Machines — scikit-learn 0.20.2 documentation |access-date=2017-11-08 |url-status=live |archive-url=https://web.archive.org/web/20171108151644/http://scikit-learn.org/stable/modules/svm.html |archive-date=2017-11-08 }}</ref> Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the nearest training-data point of any class (so-called functional margin), since in general the larger the margin, the lower the [[generalization error]] of the classifier.<ref>{{cite book |first1=Trevor |last1=Hastie |author-link=Trevor Hastie |first2=Robert |last2=Tibshirani |author-link2=Robert Tibshirani |first3=Jerome |last3=Friedman |author-link3=Jerome H. Friedman |title=The Elements of Statistical Learning : Data Mining, Inference, and Prediction |location=New York |publisher=Springer |edition=Second |year=2008 |page=134 |url=https://web.stanford.edu/~hastie/Papers/ESLII.pdf#page=153 }}</ref> A lower [[generalization error]] means that the implementer is less likely to experience [[overfitting]].

[[File:Kernel_Machine.svg|thumb|Kernel machine]]
Whereas the original problem may be stated in a finite-dimensional space, it often happens that the sets to discriminate are not [[Linear separability|linearly separable]] in that space. For this reason, it was proposed<ref name="ReferenceA" /> that the original finite-dimensional space be mapped into a much higher-dimensional space, presumably making the separation easier in that space. To keep the computational load reasonable, the mappings used by SVM schemes are designed to ensure that [[dot product]]s of pairs of input data vectors may be computed easily in terms of the variables in the original space, by defining them in terms of a [[Positive-definite kernel|kernel function]] <math>k(x, y)</math> selected to suit the problem.<ref>{{Cite book |last1=Press |first1=William H. |last2=Teukolsky |first2=Saul A. |last3=Vetterling |first3=William T. |last4=Flannery |first4=Brian P. |year=2007 |title=Numerical Recipes: The Art of Scientific Computing |edition=3rd |publisher=Cambridge University Press |location=New York |isbn=978-0-521-88068-8 |chapter=Section 16.5. Support Vector Machines |chapter-url=http://apps.nrbook.com/empanel/index.html#pg=883 |url-status=live |archive-url=https://web.archive.org/web/20110811154417/http://apps.nrbook.com/empanel/index.html#pg=883 |archive-date=2011-08-11 }}
</ref> The hyperplanes in the higher-dimensional space are defined as the set of points whose dot product with a vector in that space is constant, where such a set of vectors is an orthogonal (and thus minimal) set of vectors that defines a hyperplane. The vectors defining the hyperplanes can be chosen to be linear combinations with parameters <math>\alpha_i</math> of images of [[feature vector]]s <math>x_i</math> that occur in the data base. With this choice of a hyperplane, the points <math>x</math> in the [[feature space]] that are mapped into the hyperplane are defined by the relation <math>\textstyle\sum_i \alpha_i k(x_i, x) = \text{constant}.</math>  Note that if <math>k(x, y)</math> becomes small as <math>y</math> grows further away from <math>x</math>, each term in the sum measures the degree of closeness of the test point <math>x</math> to the corresponding data base point <math>x_i</math>. In this way, the sum of kernels above can be used to measure the relative nearness of each test point to the data points originating in one or the other of the sets to be discriminated. Note the fact that the set of points <math>x</math> mapped into any hyperplane can be quite convoluted as a result, allowing much more complex discrimination between sets that are not convex at all in the original space.