Editing Support vector machine (section)

== Empirical risk minimization ==
The soft-margin support vector machine described above is an example of an [[empirical risk minimization]] (ERM) algorithm for the ''[[hinge loss]]''. Seen this way, support vector machines belong to a natural class of algorithms for statistical inference, and many of its unique features are due to the behavior of the hinge loss. This perspective can provide further insight into how and why SVMs work, and allow us to better analyze their statistical properties.

=== Risk minimization ===
In supervised learning, one is given a set of training examples <math>X_1 \ldots X_n</math> with labels <math>y_1 \ldots y_n</math>, and wishes to predict <math>y_{n+1}</math> given <math>X_{n+1}</math>. To do so one forms a [[hypothesis]], <math>f</math>, such that <math>f(X_{n+1})</math> is a "good" approximation of <math>y_{n+1}</math>. A "good" approximation is usually defined with the help of a ''[[loss function]],'' <math>\ell(y,z)</math>, which characterizes how bad <math>z</math> is as a prediction of <math>y</math>. We would then like to choose a hypothesis that minimizes the ''[[Loss function#Expected loss|expected risk]]:''

<math display="block">\varepsilon(f) = \mathbb{E}\left[\ell(y_{n+1}, f(X_{n+1})) \right].</math>

In most cases, we don't know the joint distribution of <math>X_{n+1},\,y_{n+1}</math> outright. In these cases, a common strategy is to choose the hypothesis that minimizes the ''empirical risk:''

<math display="block">\hat \varepsilon(f) = \frac 1 n \sum_{k=1}^n \ell(y_k, f(X_k)).</math>

Under certain assumptions about the sequence of random variables <math>X_k,\, y_k</math> (for example, that they are generated by a finite Markov process), if the set of hypotheses being considered is small enough, the minimizer of the empirical risk will closely approximate the minimizer of the expected risk as <math>n</math> grows large. This approach is called ''empirical risk minimization,'' or ERM.

=== Regularization and stability ===
In order for the minimization problem to have a well-defined solution, we have to place constraints on the set <math>\mathcal{H}</math> of hypotheses being considered. If <math>\mathcal{H}</math> is a [[Normed vector space|normed space]] (as is the case for SVM), a particularly effective technique is to consider only those hypotheses <math> f</math> for which <math>\lVert f \rVert_{\mathcal H} < k</math> . This is equivalent to imposing a ''regularization penalty'' <math>\mathcal R(f) = \lambda_k\lVert f \rVert_{\mathcal H}</math>, and solving the new optimization problem

<math display="block">\hat f = \mathrm{arg}\min_{f \in \mathcal{H}} \hat \varepsilon(f) + \mathcal{R}(f).</math>

This approach is called ''[[Tikhonov regularization]].''

More generally, <math>\mathcal{R}(f)</math> can be some measure of the complexity of the hypothesis <math>f</math>, so that simpler hypotheses are preferred.

=== SVM and the hinge loss ===
Recall that the (soft-margin) SVM classifier <math>\hat\mathbf{w}, b: \mathbf{x} \mapsto \sgn(\hat\mathbf{w}^\mathsf{T} \mathbf{x} - b)</math> is chosen to minimize the following expression:

<math display="block">\left[\frac 1 n \sum_{i=1}^n \max\left(0, 1 - y_i(\mathbf{w}^\mathsf{T} \mathbf{x} - b)\right) \right] + \lambda \|\mathbf{w}\|^2.</math>

In light of the above discussion, we see that the SVM technique is equivalent to empirical risk minimization with Tikhonov regularization, where in this case the loss function is the [[hinge loss]]

<math display="block">\ell(y,z) = \max\left(0, 1 - yz \right).</math>

From this perspective, SVM is closely related to other fundamental [[Statistical classification#Linear classifiers|classification algorithm]]s such as [[regularized least-squares]] and [[logistic regression]]. The difference between the three lies in the choice of loss function: regularized least-squares amounts to empirical risk minimization with the [[Loss functions for classification|square-loss]],  <math>\ell_{sq}(y,z) = (y-z)^2</math>; logistic regression employs the [[Loss functions for classification|log-loss]],

<math display="block">\ell_{\log}(y,z) = \ln(1 + e^{-yz}).</math>

==== Target functions ====
The difference between the hinge loss and these other loss functions is best stated in terms of ''target functions -'' the function that minimizes expected risk for a given pair of random variables <math>X,\,y</math>.

In particular, let <math>y_x</math> denote <math>y</math> conditional on the event that <math>X = x</math>.  In the classification setting, we have:

<math display="block">y_x = \begin{cases} 1 & \text{with probability } p_x \\ -1 & \text{with probability } 1-p_x  \end{cases}</math>

The optimal classifier is therefore:

<math display="block">f^*(x) = \begin{cases}1 & \text{if }p_x \geq 1/2 \\ -1 & \text{otherwise}\end{cases} </math>

For the square-loss, the target function is the conditional expectation function, <math>f_{sq}(x) = \mathbb{E}\left[y_x\right]</math>; For the logistic loss, it's the logit function, <math>f_{\log}(x) = \ln\left(p_x / ({1-p_x})\right)</math>. While both of these target functions yield the correct classifier, as <math>\sgn(f_{sq}) = \sgn(f_\log) = f^*</math>, they give us more information than we need. In fact, they give us enough information to completely describe the distribution of <math> y_x</math>.

On the other hand, one can check that the target function for the hinge loss is ''exactly'' <math>f^*</math>. Thus, in a sufficiently rich hypothesis space—or equivalently, for an appropriately chosen kernel—the SVM classifier will converge to the simplest function (in terms of <math>\mathcal{R}</math>) that correctly classifies the data. This extends the geometric interpretation of SVM—for linear classification, the empirical risk is minimized by any function whose margins lie between the support vectors, and the simplest of these is the max-margin classifier.<ref>{{Cite journal |title=Are Loss Functions All the Same? |url=https://ieeexplore.ieee.org/document/6789841 |journal=Neural Computation |date=2004-05-01 |issn=0899-7667 |pages=1063–1076 |volume=16 |issue=5 |doi=10.1162/089976604773135104 |pmid=15070510 |first1=Lorenzo |last1=Rosasco |first2=Ernesto |last2=De Vito |first3=Andrea |last3=Caponnetto |first4=Michele |last4=Piana |first5=Alessandro |last5=Verri |hdl=11380/4590 |citeseerx=10.1.1.109.6786 |s2cid=11845688 }}</ref>