Editing Support vector machine (section)

== Extensions ==

=== Multiclass SVM ===
Multiclass SVM aims to assign labels to instances by using support vector machines, where the labels are drawn from a finite set of several elements.

The dominant approach for doing so is to reduce the single [[multiclass problem]] into multiple [[binary classification]] problems.<ref name="duan2005">{{Cite book |last1=Duan |first1=Kai-Bo |last2=Keerthi |first2=S. Sathiya |chapter=Which Is the Best Multiclass SVM Method? An Empirical Study |doi=10.1007/11494683_28 |title=Multiple Classifier Systems |series=[[Lecture Notes in Computer Science|LNCS]] |volume=3541 |pages=278–285 |year=2005 |isbn=978-3-540-26306-7 |citeseerx=10.1.1.110.6789 |chapter-url=https://www.cs.iastate.edu/~honavar/multiclass-svm2.pdf |access-date=2019-07-18 |archive-date=2013-05-03 |archive-url=https://web.archive.org/web/20130503183745/http://www.cs.iastate.edu/~honavar/multiclass-svm2.pdf |url-status=dead }}</ref> Common methods for such reduction include:<ref name="duan2005" /><ref name="hsu2002">{{cite journal |title=A Comparison of Methods for Multiclass Support Vector Machines |year=2002 |journal=IEEE Transactions on Neural Networks |last1=Hsu |first1=Chih-Wei |last2=Lin |first2=Chih-Jen |volume=13 |issue=2 |pages=415–25 |name-list-style=amp |url=http://www.cs.iastate.edu/~honavar/multiclass-svm.pdf |pmid=18244442 |doi=10.1109/72.991427 |access-date=2018-01-08 |archive-date=2013-05-03 |archive-url=https://web.archive.org/web/20130503183743/http://www.cs.iastate.edu/~honavar/multiclass-svm.pdf |url-status=dead }}</ref>
* Building binary classifiers that distinguish between one of the labels and the rest (''one-versus-all'') or between every pair of classes (''one-versus-one''). Classification of new instances for the one-versus-all case is done by a winner-takes-all strategy, in which the classifier with the highest-output function assigns the class (it is important that the output functions be calibrated to produce comparable scores). For the one-versus-one approach, classification is done by a max-wins voting strategy, in which every classifier assigns the instance to one of the two classes, then the vote for the assigned class is increased by one vote, and finally the class with the most votes determines the instance classification.
* [[Directed acyclic graph]] SVM (DAGSVM)<ref>{{cite book |chapter=Large margin DAGs for multiclass classification |editor1=Solla, Sara A.|editor1-link=Sara Solla |editor2=Leen, Todd K. |editor3=Müller, Klaus-Robert |editor3-link=Klaus-Robert Müller |title=Advances in Neural Information Processing Systems |publisher=MIT Press |year=2000 |chapter-url=http://www.wisdom.weizmann.ac.il/~bagon/CVspring07/files/DAGSVM.pdf |pages=547–553 |last1=Platt |first1=John |author-link2=Nello Cristianini |last2=Cristianini |first2=Nello |author-link3=John Shawe-Taylor |last3=Shawe-Taylor |first3=John |url-status=live |archive-url=https://web.archive.org/web/20120616221540/http://www.wisdom.weizmann.ac.il/~bagon/CVspring07/files/DAGSVM.pdf |archive-date=2012-06-16 }}</ref>
* [[Error correcting code|Error-correcting output codes]]<ref>{{cite journal |title=Solving Multiclass Learning Problems via Error-Correcting Output Codes |journal=Journal of Artificial Intelligence Research |year=1995 |url=http://www.jair.org/media/105/live-105-1426-jair.pdf |pages=263–286 |last1=Dietterich |first1=Thomas G. |last2=Bakiri |first2=Ghulum |bibcode=1995cs........1101D |arxiv=cs/9501101 |volume=2 |url-status=live |archive-url=https://web.archive.org/web/20130509061344/http://www.jair.org/media/105/live-105-1426-jair.pdf |archive-date=2013-05-09 |doi=10.1613/jair.105 |s2cid=47109072 }}</ref>

Crammer and Singer proposed a multiclass SVM method which casts the [[multiclass classification]] problem into a single optimization problem, rather than decomposing it into multiple binary classification problems.<ref>{{cite journal |title=On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines |year=2001 |url=http://jmlr.csail.mit.edu/papers/volume2/crammer01a/crammer01a.pdf |journal=Journal of Machine Learning Research |volume=2 |pages=265–292 |last1=Crammer |first1=Koby |last2=Singer |first2=Yoram |name-list-style=amp |url-status=live |archive-url=https://web.archive.org/web/20150829102651/http://jmlr.csail.mit.edu/papers/volume2/crammer01a/crammer01a.pdf |archive-date=2015-08-29 }}</ref> See also Lee, Lin and Wahba<ref>{{cite journal |title=Multicategory Support Vector Machines |year=2001 |journal=Computing Science and Statistics |volume=33 |url=http://www.interfacesymposia.org/I01/I2001Proceedings/YLee/YLee.pdf |last1=Lee |first1=Yoonkyung |last2=Lin |first2=Yi |last3=Wahba |first3=Grace |name-list-style=amp |url-status=usurped |archive-url=https://web.archive.org/web/20130617093314/http://www.interfacesymposia.org/I01/I2001Proceedings/YLee/YLee.pdf |archive-date=2013-06-17 }}</ref><ref>{{Cite journal |doi=10.1198/016214504000000098 |title=Multicategory Support Vector Machines |journal=Journal of the American Statistical Association |volume=99 |issue=465 |pages=67–81 |year=2004 |last1=Lee |first1=Yoonkyung |last2=Lin |first2=Yi |last3=Wahba |first3=Grace |citeseerx=10.1.1.22.1879 |s2cid=7066611 }}</ref> and Van den Burg and Groenen.<ref>{{Cite journal |title=GenSVM: A Generalized Multiclass Support Vector Machine |year=2016|url=http://jmlr.org/papers/volume17/14-526/14-526.pdf |journal=Journal of Machine Learning Research |volume=17|issue=224|pages=1–42|last1=Van den Burg|first1=Gerrit J. J. |last2=Groenen |first2=Patrick J. F.|name-list-style=amp}}</ref>

=== Transductive support vector machines ===
Transductive support vector machines extend SVMs in that they could also treat partially labeled data in [[semi-supervised learning]] by following the principles of [[Transduction (machine learning)|transduction]]. Here, in addition to the training set <math>\mathcal{D}</math>, the learner is also given a set

<math display="block">\mathcal{D}^\star = \{ \mathbf{x}^\star_i \mid \mathbf{x}^\star_i \in \mathbb{R}^p\}_{i=1}^k </math>

of test examples to be classified. Formally, a transductive support vector machine is defined by the following primal optimization problem:<ref>{{Cite conference |last=Joachims |first=Thorsten |title=Transductive Inference for Text Classification using Support Vector Machines |url=http://www1.cs.columbia.edu/~dplewis/candidacy/joachims99transductive.pdf |conference=Proceedings of the 1999 International Conference on Machine Learning (ICML 1999) |pages=200–209}}</ref>

Minimize (in <math>\mathbf{w}, b, \mathbf{y}^\star</math>)

<math display="block">\frac{1}{2}\|\mathbf{w}\|^2</math>

subject to (for any <math>i = 1, \dots, n</math> and any <math>j = 1, \dots, k</math>)

<math display="block">\begin{align}
&y_i(\mathbf{w} \cdot \mathbf{x}_i - b) \ge 1, \\
&y^\star_j(\mathbf{w} \cdot \mathbf{x}^\star_j - b) \ge 1,
\end{align}</math>

and

<math display="block">y^\star_j \in \{-1, 1\}.</math>

Transductive support vector machines were introduced by Vladimir N. Vapnik in 1998.

=== Structured SVM ===
{{Main|structured SVM}}
Structured support-vector machine is an extension of the traditional SVM model. While the SVM model is primarily designed for binary classification, multiclass classification, and regression tasks, structured SVM broadens its application to handle general structured output labels, for example parse trees, classification with taxonomies, sequence alignment and many more.<ref>{{Cite web | url=https://www.cs.cornell.edu/people/tj/publications/tsochantaridis_etal_04a.pdf | title=Support Vector Machine Learning for Interdependent and Structured Output Spaces | website=www.cs.cornell.edu}}</ref>

=== Regression ===
[[Image:Svr epsilons demo.svg|thumbnail|right|Support vector regression (prediction) with different thresholds ''ε''. As ''ε'' increases, the prediction becomes less sensitive to errors.]]
A version of SVM for [[regression analysis|regression]] was proposed in 1996 by [[Vladimir N. Vapnik]], Harris Drucker, Christopher J. C. Burges, Linda Kaufman and Alexander J. Smola.<ref>Drucker, Harris; Burges, Christ. C.; Kaufman, Linda; Smola, Alexander J.; and Vapnik, Vladimir N. (1997); "[http://papers.nips.cc/paper/1238-support-vector-regression-machines.pdf Support Vector Regression Machines]", in ''Advances in Neural Information Processing Systems 9, NIPS 1996'', 155–161, MIT Press.</ref> This method is called support vector regression (SVR). The model produced by support vector classification (as described above) depends only on a subset of the training data, because the cost function for building the model does not care about training points that lie beyond the margin. Analogously, the model produced by SVR depends only on a subset of the training data, because the cost function for building the model ignores any training data close to the model prediction. Another SVM version known as [[least-squares support vector machine]] (LS-SVM) has been proposed by Suykens and Vandewalle.<ref>Suykens, Johan A. K.; Vandewalle, Joos P. L.; "[https://lirias.kuleuven.be/bitstream/123456789/218716/2/Suykens_NeurProcLett.pdf Least squares support vector machine classifiers]", ''Neural Processing Letters'', vol. 9, no. 3, Jun. 1999, pp. 293–300.</ref>

Training the original SVR means solving<ref>{{cite journal |last1=Smola |first1=Alex J. |first2=Bernhard |last2=Schölkopf |title=A tutorial on support vector regression |journal=Statistics and Computing |volume=14 |issue=3 |year=2004 |pages=199–222 |url=http://eprints.pascal-network.org/archive/00000856/01/fulltext.pdf |url-status=live |archive-url=https://web.archive.org/web/20120131193522/http://eprints.pascal-network.org/archive/00000856/01/fulltext.pdf |archive-date=2012-01-31 |doi=10.1023/B:STCO.0000035301.49549.88 |citeseerx=10.1.1.41.1452 |s2cid=15475 }}</ref>

: minimize <math>\tfrac{1}{2} \|w\|^2 </math>
: subject to <math> | y_i - \langle w, x_i \rangle  - b | \le \varepsilon </math>

where <math>x_i</math> is a training sample with target value <math>y_i</math>. The inner product plus intercept <math>\langle w, x_i \rangle + b</math> is the prediction for that sample, and <math>\varepsilon</math> is a free parameter that serves as a threshold: all predictions have to be within an <math>\varepsilon</math> range of the true predictions. Slack variables are usually added into the above to allow for errors and to allow approximation in the case the above problem is infeasible.

=== Bayesian SVM ===
In 2011 it was shown by Polson and Scott that the SVM admits a [[Bayesian probability|Bayesian]] interpretation through the technique of [[data augmentation]].<ref>{{cite journal |last1=Polson |first1=Nicholas G. |last2=Scott |first2=Steven L. |title=Data Augmentation for Support Vector Machines |journal=Bayesian Analysis |year=2011 |volume=6 |issue=1 |pages=1–23 |doi=10.1214/11-BA601 |doi-access=free }}</ref> In this approach the SVM is viewed as a [[graphical model]] (where the parameters are connected via probability distributions). This extended view allows the application of [[Bayesian probability|Bayesian]] techniques to SVMs, such as flexible feature modeling, automatic [[Hyperparameter (machine learning)|hyperparameter]] tuning, and [[Posterior predictive distribution|predictive uncertainty quantification]]. Recently, a scalable version of the Bayesian SVM was developed by [https://arxiv.org/search/stat?searchtype=author&query=Wenzel%2C+F Florian Wenzel], enabling the application of Bayesian SVMs to [[big data]].<ref>{{cite book |last1=Wenzel |first1=Florian |last2=Galy-Fajou |first2=Theo |last3=Deutsch |first3=Matthäus |last4=Kloft |first4=Marius |title=Machine Learning and Knowledge Discovery in Databases |chapter=Bayesian Nonlinear Support Vector Machines for Big Data |volume=10534 |pages=307–322 |year=2017 |arxiv=1707.05532 |bibcode=2017arXiv170705532W |doi=10.1007/978-3-319-71249-9_19 |series=Lecture Notes in Computer Science |isbn=978-3-319-71248-2 |s2cid=4018290 }}</ref> Florian Wenzel developed two different versions, a variational inference (VI) scheme for the Bayesian kernel support vector machine (SVM) and a stochastic version (SVI) for the linear Bayesian SVM.<ref>Florian Wenzel; Matthäus Deutsch; Théo Galy-Fajou; Marius Kloft; [http://approximateinference.org/accepted/WenzelEtAl2016.pdf ”Scalable Approximate Inference for the Bayesian Nonlinear Support Vector Machine”]</ref>