Editing Statistical classification

{{Short description|Categorization of data using statistics}}
When [[classification]] is performed by a computer, statistical methods are normally used to develop the algorithm.

Often, the individual observations are analyzed into a set of quantifiable properties, known variously as [[explanatory variables]] or ''features''.  These properties may variously be [[categorical data|categorical]] (e.g. "A", "B", "AB" or "O", for [[blood type]]), [[ordinal data|ordinal]] (e.g. "large", "medium" or "small"), [[integer|integer-valued]] (e.g. the number of occurrences of a particular word in an [[email]]) or [[real number|real-valued]] (e.g. a measurement of [[blood pressure]]). Other classifiers work by comparing observations to previous observations by means of a [[similarity function|similarity]] or [[metric (mathematics)|distance]] function.

An [[algorithm]] that implements classification, especially in a concrete implementation, is known as a '''classifier'''.  The term "classifier" sometimes also refers to the mathematical [[function (mathematics)|function]], implemented by a classification algorithm, that maps input data to a category.

Terminology across fields is quite varied. In [[statistics]], where classification is often done with [[logistic regression]] or a similar procedure, the properties of observations are termed [[explanatory variable]]s (or [[independent variable]]s, regressors, etc.), and the categories to be predicted are known as outcomes, which are considered to be possible values of the [[dependent variable]].  In [[machine learning]], the observations are often known as ''instances'', the explanatory variables are termed ''features'' (grouped into a [[feature vector]]), and the possible categories to be predicted are ''classes''.  Other fields may use different terminology: e.g. in [[community ecology]], the term "classification" normally refers to [[cluster analysis]].

==Relation to other problems==
[[Classification]] and clustering are examples of the more general problem of [[pattern recognition]], which is the assignment of some sort of output value to a given input value.  Other examples are [[regression analysis|regression]], which assigns a real-valued output to each input; [[sequence labeling]], which assigns a class to each member of a sequence of values (for example, [[part of speech tagging]], which assigns a [[part of speech]] to each word in an input sentence); [[parsing]], which assigns a [[parse tree]] to an input sentence, describing the [[syntactic structure]] of the sentence; etc.

A common subclass of classification is [[probabilistic classification]].  Algorithms of this nature use [[statistical inference]] to find the best class for a given instance.  Unlike other algorithms, which simply output a "best" class, probabilistic algorithms output a [[probability]] of the instance being a member of each of the possible classes.  The best class is normally then selected as the one with the highest probability.  However, such an algorithm has numerous advantages over non-probabilistic classifiers:
*It can output a confidence value associated with its choice (in general, a classifier that can do this is known as a ''confidence-weighted classifier'').
*Correspondingly, it can ''abstain'' when its confidence of choosing any particular output is too low.
*Because of the probabilities which are generated, probabilistic classifiers can be more effectively incorporated into larger machine-learning tasks, in a way that partially or completely avoids the problem of ''error propagation''.

==Frequentist procedures==

Early work on statistical classification was undertaken by [[Ronald Fisher|Fisher]],<ref>{{Cite journal |doi = 10.1111/j.1469-1809.1936.tb02137.x|title = The Use of Multiple Measurements in Taxonomic Problems|year = 1936|last1 = Fisher|first1 = R. A.|journal = [[Annals of Eugenics]]|volume = 7|issue = 2|pages = 179–188|hdl = 2440/15227|hdl-access = free}}</ref><ref>{{Cite journal |doi = 10.1111/j.1469-1809.1938.tb02189.x|title = The Statistical Utilization of Multiple Measurements|year = 1938|last1 = Fisher|first1 = R. A.|journal = [[Annals of Eugenics]]|volume = 8|issue = 4|pages = 376–386|hdl = 2440/15232|hdl-access = free}}</ref> in the context of two-group problems, leading to [[Fisher's linear discriminant]] function as the rule for assigning a group to a new observation.<ref name=G1977>Gnanadesikan, R. (1977) ''Methods for Statistical Data Analysis of Multivariate Observations'', Wiley. {{ISBN|0-471-30845-5}} (p. 83&ndash;86)</ref> This early work assumed that data-values within each of the two groups had a [[multivariate normal distribution]]. The extension of this same context to more than two groups has also been considered with a restriction imposed that the classification rule should be [[linear]].<ref name=G1977/><ref>[[C. R. Rao|Rao, C.R.]] (1952) ''Advanced Statistical Methods in Multivariate Analysis'', Wiley. (Section 9c)</ref> Later work for the multivariate normal distribution allowed the classifier to be [[nonlinear]]:<ref>[[T. W. Anderson|Anderson, T.W.]] (1958) ''An Introduction to Multivariate Statistical Analysis'', Wiley.</ref> several classification rules can be derived based on different adjustments of the [[Mahalanobis distance]], with a new observation being assigned to the group whose centre has the lowest adjusted distance from the observation.

==Bayesian procedures==

Unlike frequentist procedures, Bayesian classification procedures provide a natural way of taking into account any available information about the relative sizes of the different groups within the overall population.<ref>{{Cite journal |doi = 10.1093/biomet/65.1.31|title = Bayesian cluster analysis|year = 1978|last1 = Binder|first1 = D. A.|journal = [[Biometrika]]|volume = 65|pages = 31–38}}</ref> Bayesian procedures tend to be computationally expensive and, in the days before [[Markov chain Monte Carlo]] computations were developed, approximations for Bayesian clustering rules were devised.<ref>{{Cite journal | doi=10.1093/biomet/68.1.275| title=Approximations to Bayesian clustering rules| year=1981| last1=Binder| first1=David A.| journal=[[Biometrika]]| volume=68| pages=275–285}}</ref>

Some Bayesian procedures involve the calculation of  [[group-membership probabilities]]: these provide a more informative outcome than a simple attribution of a single group-label to each new observation.

==Binary and multiclass classification==
Classification can be thought of as two separate problems – [[binary classification]] and [[multiclass classification]]. In binary classification, a better understood task, only two classes are involved, whereas multiclass classification involves assigning an object to one of several classes.<ref>[[Sariel Har-Peled|Har-Peled, S.]], Roth, D., Zimak, D. (2003) "Constraint Classification for Multiclass Classification and Ranking." In: Becker, B., [[Sebastian Thrun|Thrun, S.]], Obermayer, K. (Eds) ''Advances in Neural Information Processing Systems 15: Proceedings of the 2002 Conference'', MIT Press. {{ISBN|0-262-02550-7}}</ref> Since many classification methods have been developed specifically for binary classification, multiclass classification often requires the combined use of multiple binary classifiers.

== Feature vectors ==
{{main|Feature vector}}

Most algorithms describe an individual instance whose category is to be predicted using a [[feature vector]] of individual, measurable properties of the instance.  Each property is termed a [[feature (pattern recognition)|feature]], also known in statistics as an [[explanatory variable]] (or [[independent variable]], although features may or may not be [[statistically independent]]).  Features may variously be [[binary data|binary]] (e.g. "on" or "off"); [[categorical data|categorical]] (e.g. "A", "B", "AB" or "O", for [[blood type]]); [[ordinal data|ordinal]] (e.g. "large", "medium" or "small"); [[integer|integer-valued]] (e.g. the number of occurrences of a particular word in an email); or [[real number|real-valued]] (e.g. a measurement of blood pressure).  If the instance is an image, the feature values might correspond to the pixels of an image; if the instance is a piece of text, the feature values might be occurrence frequencies of different words.  Some algorithms work only in terms of discrete data and require that real-valued or integer-valued data be ''discretized'' into groups (e.g. less than 5, between 5 and 10, or greater than 10).

==Linear classifiers==
{{main|Linear classifier}}
A large number of [[algorithm]]s for classification can be phrased in terms of a [[linear function]] that assigns a score to each possible category ''k'' by [[linear combination|combining]] the feature vector of an instance with a vector of weights, using a [[dot product]].  The predicted category is the one with the highest score.  This type of score function is known as a [[linear predictor function]] and has the following general form:
<math display=block>\operatorname{score}(\mathbf{X}_i, k) = \boldsymbol\beta_k \cdot \mathbf{X}_i,</math>
where '''X'''<sub>''i''</sub> is the feature vector for instance ''i'', '''β'''<sub>''k''</sub> is the vector of weights corresponding to category ''k'', and score('''X'''<sub>''i''</sub>, ''k'') is the score associated with assigning instance ''i'' to category ''k''.  In [[discrete choice]] theory, where instances represent people and categories represent choices, the score is considered the [[utility]] associated with person ''i'' choosing category ''k''.

Algorithms with this basic setup are known as [[linear classifier]]s.  What distinguishes them is the procedure for determining (training) the optimal weights/coefficients and the way that the score is interpreted.

Examples of such algorithms include
* {{annotated link|Logistic regression}}
** {{annotated link|Multinomial logistic regression}}
* {{annotated link|Probit regression}}
* The [[perceptron]] algorithm
* {{annotated link|Support vector machine}}
* {{annotated link|Linear discriminant analysis}}

==Algorithms==

Since no single form of classification is appropriate for all data sets, a large toolkit of classification algorithms has been developed. The most commonly used include:<ref>{{Cite news|url=https://builtin.com/data-science/tour-top-10-algorithms-machine-learning-newbies|title=A Tour of The Top 10 Algorithms for Machine Learning Newbies|date=2018-01-20|work=Built In|access-date=2019-06-10}}</ref>

* {{annotated link|Artificial neural networks}}
* {{annotated link|Boosting (machine learning)}}
* {{annotated link|Random forest}}
* {{annotated link|Genetic programming}}
** {{annotated link|Gene expression programming}}
** {{annotated link|Multi expression programming}}
** {{annotated link|Linear genetic programming}}
* {{annotated link|Kernel estimation|text=Variable kernel density estimation#Use for statistical classification}}
** {{annotated link|k-nearest neighbor algorithm|k-nearest neighbor}}
* {{annotated link|Learning vector quantization}}
* {{annotated link|Linear classifier}}
** {{annotated link|Fisher's linear discriminant}}
** {{annotated link|Logistic regression}}
** {{annotated link|Naive Bayes classifier}}
** {{annotated link|Perceptron}}
* {{annotated link|Quadratic classifier}}
* {{annotated link|Support vector machine}}
** {{annotated link|Least squares support vector machine}}

Choices between different possible algorithms are frequently made on the basis of quantitative [[Classification#Evaluation of accuracy|evaluation of accuracy]].

==Application domains==
{{see also|Cluster analysis#Applications}}
Classification has many applications. In some of these, it is employed as a [[data mining]] procedure, while in others more detailed statistical modeling is undertaken.

* {{annotated link|Biological classification}}
* {{annotated link|Biometric}} identification
* {{annotated link|Computer vision}}
** Medical image analysis and {{annotated link|medical imaging}}
** {{annotated link|Optical character recognition}}
** {{annotated link|Video tracking}}
* {{annotated link|Credit scoring}}
* {{annotated link|Document classification}}
* [[Drug discovery]] and {{annotated link|Drug development|development}}
** {{annotated link|Toxicogenomics}}
** {{annotated link|Quantitative structure-activity relationship}}
* {{annotated link|Geostatistics}}
* {{annotated link|Handwriting recognition}}
* Internet {{annotated link|search engines}}
* [[Micro-array classification]]
* {{annotated link|Pattern recognition}}
* {{annotated link|Recommender system}}
* {{annotated link|Speech recognition}}
* {{annotated link|Statistical natural language processing}}

{{More footnotes needed|date=January 2010}}

==See also==
{{Portal|Mathematics}}
{{colbegin}}
* {{annotated link|Artificial intelligence}}
* {{annotated link|Binary classification}}
* {{annotated link|Multiclass classification}}
* {{annotated link|Class membership probabilities}}
* {{annotated link|Classification rule}}
* {{annotated link|Compound term processing}}
* {{annotated link|Confusion matrix}}
* {{annotated link|Data mining}}
* {{annotated link|Data warehouse}}
* {{annotated link|Fuzzy logic}}
* {{annotated link|Information retrieval}}
* {{annotated link|List of datasets for machine learning research}}
* {{annotated link|Machine learning}}
* {{annotated link|Recommender system}}
{{colend}}

==References==
{{Commons category}}
{{Reflist}}

{{Statistics|analysis||state=expanded}}
{{Authority control}}

{{DEFAULTSORT:Statistical Classification}}
[[Category:Statistical classification| ]]
[[Category:Classification algorithms|*]]