Editing Pattern recognition (section)

==Overview==
{{Further|topic=Combination Of Shifted FIlter REsponses|COSFIRE}}

A modern definition of pattern recognition is:
{{blockquote
|The field of pattern recognition is concerned with the automatic discovery of regularities in data through the use of computer algorithms and with the use of these regularities to take actions such as classifying the data into different categories.<ref name="Bishop2006">
{{cite book |first=Christopher M. |last=Bishop |year=2006 |title=Pattern Recognition and Machine Learning |publisher=Springer}}</ref>}}

Pattern recognition is generally categorized according to the type of learning procedure used to generate the output value. ''[[Supervised learning]]'' assumes that a set of training data (the [[training set]]) has been provided, consisting of a set of instances that have been properly labeled by hand with the correct output. A learning procedure then generates a model that attempts to meet two sometimes conflicting objectives: Perform as well as possible on the training data, and generalize as well as possible to new data (usually, this means being as simple as possible, for some technical definition of "simple", in accordance with [[Occam's Razor]], discussed below). [[Unsupervised learning]], on the other hand, assumes training data that has not been hand-labeled, and attempts to find inherent patterns in the data that can then be used to determine the correct output value for new data instances.<ref>{{Cite journal| author= Carvalko, J.R., Preston K. | year=1972 |title= On Determining Optimum Simple Golay Marking Transforms for Binary Image Processing | journal= IEEE Transactions on Computers  | volume=21 | issue=12 | pages=1430–33  | doi = 10.1109/T-C.1972.223519| s2cid=21050445 }}.</ref> A combination of the two that has been explored is [[semi-supervised learning]], which uses a combination of labeled and unlabeled data (typically a small set of labeled data combined with a large amount of unlabeled data). In cases of unsupervised learning, there may be no training data at all.

Sometimes different terms are used to describe the corresponding supervised and unsupervised learning procedures for the same type of output. The unsupervised equivalent of classification is normally known as ''[[data clustering|clustering]]'', based on the common perception of the task as involving no training data to speak of, and of grouping the input data into clusters based on some inherent [[similarity measure]] (e.g. the [[distance]] between instances, considered as vectors in a multi-dimensional [[vector space]]), rather than assigning each input instance into one of a set of pre-defined classes. In some fields, the terminology is different. In [[community ecology]], the term ''classification'' is used to refer to what is commonly known as "clustering".

The piece of input data for which an output value is generated is formally termed an ''instance''. The instance is formally described by a [[feature vector|vector]] of features, which together constitute a description of all known characteristics of the instance. These feature vectors can be seen as defining points in an appropriate [[space (mathematics)|multidimensional space]], and methods for manipulating vectors in [[vector space]]s can be correspondingly applied to them, such as computing the [[dot product]] or the angle between two vectors. Features typically are either [[categorical data|categorical]] (also known as [[nominal data|nominal]], i.e., consisting of one of a set of unordered items, such as a gender of "male" or "female", or a blood type of "A", "B", "AB" or "O"), [[ordinal data|ordinal]] (consisting of one of a set of ordered items, e.g., "large", "medium" or "small"), [[integer|integer-valued]] (e.g., a count of the number of occurrences of a particular word in an email) or [[real number|real-valued]] (e.g., a measurement of blood pressure). Often, categorical and ordinal data are grouped together, and this is also the case for integer-valued and real-valued data. Many algorithms work only in terms of categorical data and require that real-valued or integer-valued data be ''discretized'' into groups (e.g., less than 5, between 5 and 10, or greater than 10).

===Probabilistic classifiers===
{{Main|Probabilistic classifier}}
Many common pattern recognition algorithms are ''probabilistic'' in nature, in that they use [[statistical inference]] to find the best label for a given instance. Unlike other algorithms, which simply output a "best" label, often probabilistic algorithms also output a [[probability]] of the instance being described by the given label. In addition, many probabilistic algorithms output a list of the ''N''-best labels with associated probabilities, for some value of ''N'', instead of simply a single best label. When the number of possible labels is fairly small (e.g., in the case of [[classification (machine learning)|classification]]), ''N'' may be set so that the probability of all possible labels is output. Probabilistic algorithms have many advantages over non-probabilistic algorithms:
*They output a confidence value associated with their choice. (Note that some other algorithms may also output confidence values, but in general, only for probabilistic algorithms is this value mathematically grounded in [[probability theory]]. Non-probabilistic confidence values can in general not be given any specific meaning, and only used to compare against other confidence values output by the same algorithm.)
*Correspondingly, they can ''abstain'' when the confidence of choosing any particular output is too low.
*Because of the probabilities output, probabilistic pattern-recognition algorithms can be more effectively incorporated into larger machine-learning tasks, in a way that partially or completely avoids the problem of ''error propagation''.

===Number of important feature variables===
[[Feature selection]] algorithms attempt to directly prune out redundant or irrelevant features. A general introduction to [[feature selection]] which summarizes approaches and challenges, has been given.<ref>Isabelle Guyon Clopinet, André Elisseeff (2003). ''An Introduction to Variable and Feature Selection''. The Journal of Machine Learning Research, Vol. 3, 1157-1182. [http://www-vis.lbl.gov/~romano/mlgroup/papers/guyon03a.pdf Link] {{Webarchive|url=https://web.archive.org/web/20160304035940/http://www-vis.lbl.gov/~romano/mlgroup/papers/guyon03a.pdf |date=2016-03-04 }}</ref> The complexity of feature-selection is, because of its non-monotonous character, an [[optimization problem]] where given a total of <math>n</math> features the [[powerset]] consisting of all <math>2^n-1</math> subsets of features need to be explored. The [[Branch and bound|Branch-and-Bound algorithm]]<ref>
{{Cite journal|author1=Iman Foroutan |author2=Jack Sklansky | year=1987 |
title=Feature Selection for Automatic Classification of Non-Gaussian Data | journal=IEEE Transactions on Systems, Man, and Cybernetics | volume=17 | pages=187&ndash;198 | doi = 10.1109/TSMC.1987.4309029 | issue=2
|s2cid=9871395 }}.</ref> does reduce this complexity but is intractable for medium to large values of the number of available features <math>n</math>

Techniques to transform the raw feature vectors ('''feature extraction''') are sometimes used prior to application of the pattern-matching algorithm. [[Feature extraction]] algorithms attempt to reduce a large-dimensionality feature vector into a smaller-dimensionality vector that is easier to work with and encodes less redundancy, using mathematical techniques such as [[principal components analysis]] (PCA). The distinction between '''feature selection''' and '''feature extraction''' is that the resulting features after feature extraction has taken place are of a different sort than the original features and may not easily be interpretable, while the features left after feature selection are simply a subset of the original features.