Editing Naive Bayes classifier (section)

== Probabilistic model ==

Abstractly, naive Bayes is a [[conditional probability]] model: it assigns probabilities <math>p(C_k \mid x_1, \ldots, x_n)</math> for each of the {{mvar|K}} possible outcomes or ''classes'' <math>C_k</math> given a problem instance to be classified, represented by a vector <math>\mathbf{x} = (x_1, \ldots, x_n)</math> encoding some {{mvar|n}} features (independent variables).<ref>{{cite book | last1 = Narasimha Murty | first1 = M. | last2 = Susheela Devi | first2 = V. | title = Pattern Recognition: An Algorithmic Approach | year=2011 | publisher = Springer | isbn= 978-0857294944 }}</ref>

The problem with the above formulation is that if the number of features {{mvar|n}} is large or if a feature can take on a large number of values, then basing such a model on [[Conditional probability table|probability tables]] is infeasible. The model must therefore be reformulated to make it more tractable. Using [[Bayes' theorem]], the conditional probability can be decomposed as:
<math display="block">p(C_k \mid \mathbf{x}) = \frac{p(C_k) \ p(\mathbf{x} \mid C_k)}{p(\mathbf{x})} \,</math>

In plain English, using [[Bayesian probability]] terminology, the above equation can be written as
<math display="block">\text{posterior} = \frac{\text{prior} \times \text{likelihood}}{\text{evidence}} \,</math>

In practice, there is interest only in the numerator of that fraction, because the denominator does not depend on <math>C</math> and the values of the features <math>x_i</math> are given, so that the denominator is effectively constant.
The numerator is equivalent to the [[joint probability]] model
<math display="block">p(C_k, x_1, \ldots, x_n)\,</math>
which can be rewritten as follows, using the [[Chain rule (probability)|chain rule]] for repeated applications of the definition of [[conditional probability]]:
<math display="block">\begin{align}
p(C_k, x_1, \ldots, x_n) & = p(x_1, \ldots, x_n, C_k) \\
                        & = p(x_1 \mid x_2, \ldots, x_n, C_k) \ p(x_2, \ldots, x_n, C_k) \\
                        & = p(x_1 \mid x_2, \ldots, x_n, C_k) \ p(x_2 \mid x_3, \ldots, x_n, C_k) \ p(x_3, \ldots, x_n, C_k) \\
                        & = \cdots \\
                        & = p(x_1 \mid x_2, \ldots, x_n, C_k) \ p(x_2 \mid x_3, \ldots, x_n, C_k) \cdots   p(x_{n-1} \mid x_n, C_k) \ p(x_n \mid C_k) \ p(C_k) \\
\end{align}</math>

Now the "naive" [[conditional independence]] assumptions come into play: assume that all features in <math>\mathbf{x}</math> are [[mutually independent]], conditional on the category <math>C_k</math>. Under this assumption,
<math display="block">p(x_i \mid x_{i+1}, \ldots ,x_{n}, C_k ) = p(x_i \mid C_k)\,.</math>

Thus, the joint model can be expressed as
<math display="block">\begin{align}
p(C_k \mid x_1, \ldots, x_n) \varpropto\ & p(C_k, x_1, \ldots, x_n) \\
                            & = p(C_k) \ p(x_1 \mid C_k) \ p(x_2\mid C_k) \ p(x_3\mid C_k) \ \cdots \\
                            & = p(C_k) \prod_{i=1}^n p(x_i \mid C_k)\,,
\end{align}</math>
where <math>\varpropto</math> denotes [[Proportionality (mathematics)|proportionality]] since the denominator <math>p(\mathbf{x})</math> is omitted.

This means that under the above independence assumptions, the conditional distribution over the class variable <math>C</math> is:
<math display="block">p(C_k \mid x_1, \ldots, x_n) = \frac{1}{Z} \ p(C_k) \prod_{i=1}^n p(x_i \mid C_k)</math>
where the evidence <math>Z = p(\mathbf{x}) = \sum_k p(C_k) \ p(\mathbf{x} \mid C_k)</math> is a scaling factor dependent only on <math>x_1, \ldots, x_n</math>, that is, a constant if the values of the feature variables are known.

Often, it is only necessary to [[Discriminative model|discriminate]] between classes. In that case, the scaling factor is irrelevant, and it is sufficient to calculate the log-probability up to a factor:<math display="block">\ln p(C_k \mid x_1, \ldots, x_n) = \ln p(C_k) + \sum_{i=1}^n \ln p(x_i \mid C_k) \underbrace{- \ln Z}_{\text{irrelevant}}</math>The scaling factor is irrelevant, since discrimination subtracts it away:<math display="block">\ln \frac{p(C_k \mid x_1, \ldots, x_n)}{p(C_l \mid x_1, \ldots, x_n)} = \left(\ln p(C_k) + \sum_{i=1}^n \ln p(x_i \mid C_k) \right) - \left(\ln p(C_l) + \sum_{i=1}^n \ln p(x_i \mid C_l) \right)</math>There are two benefits of using log-probability. One is that it allows an interpretation in information theory, where log-probabilities are units of information in [[Nat (unit)|nats]]. Another is that it avoids [[arithmetic underflow]].

=== Constructing a classifier from the probability model ===

The discussion so far has derived the independent feature model, that is, the naive Bayes [[probability model]]. The naive Bayes [[Statistical classification|classifier]] combines this model with a [[decision rule]]. One common rule is to pick the hypothesis that is most probable so as to minimize the probability of misclassification; this is known as the ''[[maximum a posteriori|maximum ''a posteriori'']]'' or ''MAP'' decision rule. The corresponding classifier, a [[Bayes classifier]], is the function that assigns a class label <math>\hat{y} = C_k</math> for some {{mvar|k}} as follows:
<math display="block">\hat{y} = \underset{k \in \{1, \ldots, K\}}{\operatorname{argmax}} \ p(C_k) \displaystyle\prod_{i=1}^n p(x_i \mid C_k).</math>
[[File:ROC_curves.svg|thumb|[[Likelihood function]]s <math>p(\mathbf{x} \mid Y)</math>, [[Confusion matrix]] and [[ROC curve]]. For the naive Bayes classifier and given that the a priori probabilities <math>p(Y)</math> are the same for all classes, then the [[decision boundary]] (green line) would be placed on the point where the two probability densities intersect, due to {{nowrap|<math>p(Y \mid \mathbf{x}) = \frac{p(Y) \ p(\mathbf{x} \mid Y)}{p(\mathbf{x})} \propto p(\mathbf{x} \mid Y)</math>.}}]]