Editing Support vector machine (section)

=== Modern methods ===
Recent algorithms for finding the SVM classifier include sub-gradient descent and coordinate descent. Both techniques have proven to offer significant advantages over the traditional approach when dealing with large, sparse datasets—sub-gradient methods are especially efficient when there are many training examples, and coordinate descent when the dimension of the feature space is high.

==== Sub-gradient descent ====
[[Subgradient method|Sub-gradient descent]] algorithms for the SVM work directly with the expression

<math display="block">f(\mathbf{w}, b) = \left[\frac 1 n \sum_{i=1}^n \max\left(0, 1 - y_i(\mathbf{w}^\mathsf{T} \mathbf{x}_i - b)\right) \right] + \lambda \|\mathbf{w}\|^2.</math>

Note that <math>f</math> is a [[convex function]] of <math>\mathbf{w}</math> and <math>b</math>. As such, traditional [[gradient descent]] (or [[Stochastic gradient descent|SGD]]) methods can be adapted, where instead of taking a step in the direction of the function's gradient, a step is taken in the direction of a vector selected from the function's [[Subderivative|sub-gradient]]. This approach has the advantage that, for certain implementations, the number of iterations does not scale with <math>n</math>, the number of data points.<ref>{{Cite journal |title=Pegasos: primal estimated sub-gradient solver for SVM |journal=Mathematical Programming |date=2010-10-16 |issn=0025-5610 |pages=3–30 |volume=127 |issue=1 |doi=10.1007/s10107-010-0420-4 |first1=Shai |last1=Shalev-Shwartz |first2=Yoram |last2=Singer |first3=Nathan |last3=Srebro |first4=Andrew |last4=Cotter |citeseerx=10.1.1.161.9629 |s2cid=53306004 }}</ref>

==== Coordinate descent ====
[[Coordinate descent]] algorithms for the SVM work from the dual problem

<math display="block"> \begin{align}
&\text{maximize}\,\, f(c_1 \ldots c_n) =  \sum_{i=1}^n c_i - \frac 1 2 \sum_{i=1}^n\sum_{j=1}^n y_i c_i(x_i \cdot x_j)y_j c_j,\\
&\text{subject to } \sum_{i=1}^n c_iy_i = 0,\,\text{and } 0 \leq c_i \leq \frac{1}{2n\lambda}\;\text{for all }i.
\end{align}</math>

For each <math> i \in \{1,\, \ldots,\, n\}</math>, iteratively, the coefficient <math> c_i</math> is adjusted in the direction of <math> \partial f/ \partial c_i</math>. Then, the resulting vector of coefficients <math> (c_1',\,\ldots,\,c_n')</math> is projected onto the nearest vector of coefficients that satisfies the given constraints. (Typically Euclidean distances are used.) The process is then repeated until a near-optimal vector of coefficients is obtained. The resulting algorithm is extremely fast in practice, although few performance guarantees have been proven.<ref>{{Cite book |publisher=ACM |date=2008-01-01 |location=New York, NY, USA |isbn=978-1-60558-205-4 |pages=408–415 |doi=10.1145/1390156.1390208 |first1=Cho-Jui |last1=Hsieh |first2=Kai-Wei |last2=Chang |first3=Chih-Jen |last3=Lin |first4=S. Sathiya |last4=Keerthi |first5=S. |last5=Sundararajan |title=Proceedings of the 25th international conference on Machine learning - ICML '08 |chapter=A dual coordinate descent method for large-scale linear SVM |citeseerx=10.1.1.149.5594 |s2cid=7880266 }}</ref>