Editing Pattern recognition (section)

==Problem statement==
The problem of pattern recognition can be stated as follows: Given an unknown function <math>g:\mathcal{X}\rightarrow\mathcal{Y}</math> (the ''ground truth'') that maps input instances <math>\boldsymbol{x} \in \mathcal{X}</math> to output labels <math>y \in \mathcal{Y}</math>, along with training data <math>\mathbf{D} = \{(\boldsymbol{x}_1,y_1),\dots,(\boldsymbol{x}_n, y_n)\}</math> assumed to represent accurate examples of the mapping, produce a function <math>h:\mathcal{X}\rightarrow\mathcal{Y}</math> that approximates as closely as possible the correct mapping <math>g</math>. (For example, if the problem is filtering spam, then <math>\boldsymbol{x}_i</math> is some representation of an email and <math>y</math> is either "spam" or "non-spam"). In order for this to be a well-defined problem, "approximates as closely as possible" needs to be defined rigorously. In [[decision theory]], this is defined by specifying a [[loss function]] or cost function that assigns a specific value to "loss" resulting from producing an incorrect label. The goal then is to minimize the [[expected value|expected]] loss, with the expectation taken over the [[probability distribution]] of <math>\mathcal{X}</math>. In practice, neither the distribution of <math>\mathcal{X}</math> nor the ground truth function <math>g:\mathcal{X}\rightarrow\mathcal{Y}</math> are known exactly, but can be computed only empirically by collecting a large number of samples of <math>\mathcal{X}</math> and hand-labeling them using the correct value of <math>\mathcal{Y}</math> (a time-consuming process, which is typically the limiting factor in the amount of data of this sort that can be collected). The particular loss function depends on the type of label being predicted. For example, in the case of [[classification (machine learning)|classification]], the simple [[zero-one loss function]] is often sufficient. This corresponds simply to assigning a loss of 1 to any incorrect labeling and implies that the optimal classifier minimizes the [[Bayes error rate|error rate]] on independent test data (i.e. counting up the fraction of instances that the learned function <math>h:\mathcal{X}\rightarrow\mathcal{Y}</math> labels wrongly, which is equivalent to maximizing the number of correctly classified instances). The goal of the learning procedure is then to minimize the error rate (maximize the [[correctness (computer science)|correctness]]) on a "typical" test set.

For a probabilistic pattern recognizer, the problem is instead to estimate the probability of each possible output label given a particular input instance, i.e., to estimate a function of the form
:<math>p({\rm label}|\boldsymbol{x},\boldsymbol\theta) = f\left(\boldsymbol{x};\boldsymbol{\theta}\right)</math>
where the [[feature vector]] input is <math>\boldsymbol{x}</math>, and the function ''f'' is typically parameterized by some parameters <math>\boldsymbol{\theta}</math>.<ref>For [[linear discriminant analysis]] the parameter vector <math>\boldsymbol\theta</math> consists of the two mean vectors <math>\boldsymbol\mu_1</math> and <math>\boldsymbol\mu_2</math> and the common [[covariance matrix]] <math>\boldsymbol\Sigma</math>.</ref> In a [[discriminative model|discriminative]] approach to the problem, ''f'' is estimated directly. In a [[generative model|generative]] approach, however, the inverse probability <math>p({\boldsymbol{x}|\rm label})</math> is instead estimated and combined with the [[prior probability]] <math>p({\rm label}|\boldsymbol\theta)</math> using [[Bayes' rule]], as follows:
:<math>p({\rm label}|\boldsymbol{x},\boldsymbol\theta) = \frac{p({\boldsymbol{x}|\rm label,\boldsymbol\theta}) p({\rm label|\boldsymbol\theta})}{\sum_{L \in \text{all labels}} p(\boldsymbol{x}|L) p(L|\boldsymbol\theta)}.</math>

When the labels are [[continuous distribution|continuously distributed]] (e.g., in [[regression analysis]]), the denominator involves [[integral|integration]] rather than summation:

:<math>p({\rm label}|\boldsymbol{x},\boldsymbol\theta) = \frac{p({\boldsymbol{x}|\rm label,\boldsymbol\theta}) p({\rm label|\boldsymbol\theta})}{\int_{L \in \text{all labels}} p(\boldsymbol{x}|L) p(L|\boldsymbol\theta) \operatorname{d}L}.</math>

The value of <math>\boldsymbol\theta</math> is typically learned using [[maximum a posteriori]] (MAP) estimation. This finds the best value that simultaneously meets two conflicting objects: To perform as well as possible on the training data (smallest [[Bayes error rate|error-rate]]) and to find the simplest possible model. Essentially, this combines [[maximum likelihood]] estimation with a [[regularization (mathematics)|regularization]] procedure that favors simpler models over more complex models. In a [[Bayesian inference|Bayesian]] context, the regularization procedure can be viewed as placing a [[prior probability]] <math>p(\boldsymbol\theta)</math> on different values of <math>\boldsymbol\theta</math>. Mathematically:

:<math>\boldsymbol\theta^* = \arg \max_{\boldsymbol\theta} p(\boldsymbol\theta|\mathbf{D})</math>

where <math>\boldsymbol\theta^*</math> is the value used for <math>\boldsymbol\theta</math> in the subsequent evaluation procedure, and <math>p(\boldsymbol\theta|\mathbf{D})</math>, the [[posterior probability]] of <math>\boldsymbol\theta</math>, is given by

:<math>p(\boldsymbol\theta|\mathbf{D}) = \left[\prod_{i=1}^n p(y_i|\boldsymbol{x}_i,\boldsymbol\theta) \right] p(\boldsymbol\theta).</math>

In the [[Bayesian statistics|Bayesian]] approach to this problem, instead of choosing a single parameter vector <math>\boldsymbol{\theta}^*</math>, the probability of a given label for a new instance <math>\boldsymbol{x}</math> is computed by integrating over all possible values of <math>\boldsymbol\theta</math>, weighted according to the posterior probability:

:<math>p({\rm label}|\boldsymbol{x}) = \int p({\rm label}|\boldsymbol{x},\boldsymbol\theta)p(\boldsymbol{\theta}|\mathbf{D}) \operatorname{d}\boldsymbol{\theta}.</math>

===Frequentist or Bayesian approach to pattern recognition===
The first pattern classifier – the linear discriminant presented by [[Fisher discriminant analysis|Fisher]] – was developed in the [[Frequentist inference|frequentist]] tradition. The frequentist approach entails that the model parameters are considered unknown, but objective. The parameters are then computed (estimated) from the collected data. For the linear discriminant, these parameters are precisely the mean vectors and the [[covariance matrix]]. Also the probability of each class <math>p({\rm label}|\boldsymbol\theta)</math> is estimated from the collected dataset. Note that the usage of '[[Bayes rule]]' in a pattern classifier does not make the classification approach Bayesian.

[[Bayesian inference|Bayesian statistics]] has its origin in Greek philosophy where a distinction was already made between the '[[A priori and a posteriori|a priori]]' and the '[[A priori and a posteriori|a posteriori]]' knowledge. Later [[A priori and a posteriori#Immanuel Kant|Kant]] defined his distinction between what is a priori known – before observation – and the empirical knowledge gained from observations. In a Bayesian pattern classifier, the class probabilities <math>p({\rm label}|\boldsymbol\theta)</math> can be chosen by the user, which are then a priori. Moreover, experience quantified as a priori parameter values can be weighted with empirical observations – using e.g., the [[Beta distribution|Beta-]] ([[Conjugate prior distribution|conjugate prior]]) and [[Dirichlet distribution|Dirichlet-distributions]]. The Bayesian approach facilitates a seamless intermixing between expert knowledge in the form of subjective probabilities, and objective observations.

Probabilistic pattern classifiers can be used according to a frequentist or a Bayesian approach.