Editing Support vector machine (section)

== Linear SVM ==
[[File:SVM margin.png|thumb|Maximum-margin hyperplane and margins for an SVM trained with samples from two classes. Samples on the margin are called the support vectors.|alt=|300x300px]]
We are given a training dataset of <math>n</math> points of the form
<math display="block"> (\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n, y_n),</math>
where the <math>y_i</math> are either 1 or −1, each indicating the class to which the point <math>\mathbf{x}_i </math> belongs. Each <math>\mathbf{x}_i </math> is a <math>p</math>-dimensional [[Real number|real]] vector. We want to find the "maximum-margin hyperplane" that divides the group of points <math>\mathbf{x}_i</math> for which <math>y_i = 1</math> from the group of points for which <math>y_i = -1</math>, which is defined so that the distance between the hyperplane and the nearest point <math>\mathbf{x}_i </math> from either group is maximized.

Any [[hyperplane]] can be written as the set of points <math>\mathbf{x}</math> satisfying
<math display="block">\mathbf{w}^\mathsf{T} \mathbf{x} - b = 0,</math>
where <math>\mathbf{w}</math> is the (not necessarily normalized) [[Normal (geometry)|normal vector]] to the hyperplane. This is much like [[Hesse normal form]], except that <math>\mathbf{w}</math> is not necessarily a unit vector. The parameter <math>\tfrac{b}{\|\mathbf{w}\|}</math> determines the offset of the hyperplane from the origin along the normal vector <math>\mathbf{w}</math>.

Warning: most of the literature on the subject defines the bias so that
<math display="block">\mathbf{w}^\mathsf{T} \mathbf{x} + b = 0.</math>

=== Hard-margin ===
If the training data is [[linearly separable]], we can select two parallel hyperplanes that separate the two classes of data, so that the distance between them is as large as possible. The region bounded by these two hyperplanes is called the "margin", and the maximum-margin hyperplane is the hyperplane that lies halfway between them. With a normalized or standardized dataset, these hyperplanes can be described by the equations
: <math>\mathbf{w}^\mathsf{T} \mathbf{x} - b = 1</math> (anything on or above this boundary is of one class, with label 1)
and
: <math>\mathbf{w}^\mathsf{T} \mathbf{x} - b = -1</math> (anything on or below this boundary is of the other class, with label −1).
Geometrically, the distance between these two hyperplanes is <math>\tfrac{2}{\|\mathbf{w}\|}</math>,<ref>{{cite web |url=https://math.stackexchange.com/q/1305925/168764 |title=Why is the SVM margin equal to <math>\frac{2}{\|\mathbf{w}\|}</math> |author= |date=30 May 2015 |website=Mathematics Stack Exchange}}</ref> so to maximize the distance between the planes we want to minimize <math>\|\mathbf{w}\|</math>. The distance is computed using the [[distance from a point to a plane]] equation. We also have to prevent data points from falling into the margin, we add the following constraint: for each <math>i</math> either
<math display="block">\mathbf{w}^\mathsf{T} \mathbf{x}_i - b \ge 1 \, , \text{ if } y_i = 1,</math>
or
<math display="block">\mathbf{w}^\mathsf{T} \mathbf{x}_i - b \le -1 \, , \text{ if } y_i = -1.</math>
These constraints state that each data point must lie on the correct side of the margin.

This can be rewritten as
{{NumBlk||<math display="block">y_i(\mathbf{w}^\mathsf{T} \mathbf{x}_i - b) \ge 1, \quad \text{ for all } 1 \le i \le n.</math>|{{EquationRef|1}}}}
We can put this together to get the optimization problem:

<math>\begin{align}
&\underset{\mathbf{w},\;b}{\operatorname{minimize}} && \frac{1}{2}\|\mathbf{w}\|^2\\
&\text{subject to} && y_i(\mathbf{w}^\top \mathbf{x}_i - b) \geq 1 \quad \forall i \in \{1,\dots,n\}
\end{align}</math>

The <math>\mathbf{w}</math> and <math>b</math> that solve this problem determine the final classifier, <math>\mathbf{x} \mapsto \sgn(\mathbf{w}^\mathsf{T} \mathbf{x} - b)</math>, where <math>\sgn(\cdot)</math> is the [[sign function]].

An important consequence of this geometric description is that the max-margin hyperplane is completely determined by those <math>\mathbf{x}_i</math> that lie nearest to it (explained below). These <math>\mathbf{x}_i</math> are called ''support vectors''.{{anchor|Support vectors}}

=== Soft-margin ===
To extend SVM to cases in which the data are not linearly separable, the ''[[hinge loss]]'' function is helpful
<math display="block">\max\left(0, 1 - y_i(\mathbf{w}^\mathsf{T} \mathbf{x}_i - b)\right).</math>

Note that <math>y_i</math> is the ''i''-th target (i.e., in this case, 1 or −1), and <math>\mathbf{w}^\mathsf{T} \mathbf{x}_i - b</math> is the ''i''-th output.

This function is zero if the constraint in {{EquationNote|1|(1)}} is satisfied, in other words, if <math>\mathbf{x}_i</math> lies on the correct side of the margin. For data on the wrong side of the margin, the function's value is proportional to the distance from the margin.

The goal of the optimization then is to minimize:

<math display="block"> \lVert \mathbf{w} \rVert^2 + C \left[\frac 1 n \sum_{i=1}^n \max\left(0, 1 - y_i(\mathbf{w}^\mathsf{T} \mathbf{x}_i - b)\right) \right],</math>

where the parameter <math>C > 0</math> determines the trade-off between increasing the margin size and ensuring that the <math>\mathbf{x}_i</math> lie on the correct side of the margin (Note we can add a weight to either term in the equation above). By deconstructing the hinge loss, this optimization problem can be formulated into the following:

<math display="block">\begin{align}
&\underset{\mathbf{w},\;b,\;\mathbf{\zeta}}{\operatorname{minimize}} &&\|\mathbf{w}\|_2^2 + C\sum_{i=1}^n \zeta_i\\
&\text{subject to} && y_i(\mathbf{w}^\top \mathbf{x}_i - b) \geq 1 - \zeta_i, \quad \zeta_i \geq 0 \quad \forall i\in \{1,\dots,n\}
\end{align}</math>

Thus, for large values of <math>C</math>, it will behave similar to the hard-margin SVM, if the input data are linearly classifiable, but will still learn if a classification rule is viable or not.