Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Naive Bayes classifier
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Probabilistic model == Abstractly, naive Bayes is a [[conditional probability]] model: it assigns probabilities <math>p(C_k \mid x_1, \ldots, x_n)</math> for each of the {{mvar|K}} possible outcomes or ''classes'' <math>C_k</math> given a problem instance to be classified, represented by a vector <math>\mathbf{x} = (x_1, \ldots, x_n)</math> encoding some {{mvar|n}} features (independent variables).<ref>{{cite book | last1 = Narasimha Murty | first1 = M. | last2 = Susheela Devi | first2 = V. | title = Pattern Recognition: An Algorithmic Approach | year=2011 | publisher = Springer | isbn= 978-0857294944 }}</ref> The problem with the above formulation is that if the number of features {{mvar|n}} is large or if a feature can take on a large number of values, then basing such a model on [[Conditional probability table|probability tables]] is infeasible. The model must therefore be reformulated to make it more tractable. Using [[Bayes' theorem]], the conditional probability can be decomposed as: <math display="block">p(C_k \mid \mathbf{x}) = \frac{p(C_k) \ p(\mathbf{x} \mid C_k)}{p(\mathbf{x})} \,</math> In plain English, using [[Bayesian probability]] terminology, the above equation can be written as <math display="block">\text{posterior} = \frac{\text{prior} \times \text{likelihood}}{\text{evidence}} \,</math> In practice, there is interest only in the numerator of that fraction, because the denominator does not depend on <math>C</math> and the values of the features <math>x_i</math> are given, so that the denominator is effectively constant. The numerator is equivalent to the [[joint probability]] model <math display="block">p(C_k, x_1, \ldots, x_n)\,</math> which can be rewritten as follows, using the [[Chain rule (probability)|chain rule]] for repeated applications of the definition of [[conditional probability]]: <math display="block">\begin{align} p(C_k, x_1, \ldots, x_n) & = p(x_1, \ldots, x_n, C_k) \\ & = p(x_1 \mid x_2, \ldots, x_n, C_k) \ p(x_2, \ldots, x_n, C_k) \\ & = p(x_1 \mid x_2, \ldots, x_n, C_k) \ p(x_2 \mid x_3, \ldots, x_n, C_k) \ p(x_3, \ldots, x_n, C_k) \\ & = \cdots \\ & = p(x_1 \mid x_2, \ldots, x_n, C_k) \ p(x_2 \mid x_3, \ldots, x_n, C_k) \cdots p(x_{n-1} \mid x_n, C_k) \ p(x_n \mid C_k) \ p(C_k) \\ \end{align}</math> Now the "naive" [[conditional independence]] assumptions come into play: assume that all features in <math>\mathbf{x}</math> are [[mutually independent]], conditional on the category <math>C_k</math>. Under this assumption, <math display="block">p(x_i \mid x_{i+1}, \ldots ,x_{n}, C_k ) = p(x_i \mid C_k)\,.</math> Thus, the joint model can be expressed as <math display="block">\begin{align} p(C_k \mid x_1, \ldots, x_n) \varpropto\ & p(C_k, x_1, \ldots, x_n) \\ & = p(C_k) \ p(x_1 \mid C_k) \ p(x_2\mid C_k) \ p(x_3\mid C_k) \ \cdots \\ & = p(C_k) \prod_{i=1}^n p(x_i \mid C_k)\,, \end{align}</math> where <math>\varpropto</math> denotes [[Proportionality (mathematics)|proportionality]] since the denominator <math>p(\mathbf{x})</math> is omitted. This means that under the above independence assumptions, the conditional distribution over the class variable <math>C</math> is: <math display="block">p(C_k \mid x_1, \ldots, x_n) = \frac{1}{Z} \ p(C_k) \prod_{i=1}^n p(x_i \mid C_k)</math> where the evidence <math>Z = p(\mathbf{x}) = \sum_k p(C_k) \ p(\mathbf{x} \mid C_k)</math> is a scaling factor dependent only on <math>x_1, \ldots, x_n</math>, that is, a constant if the values of the feature variables are known. Often, it is only necessary to [[Discriminative model|discriminate]] between classes. In that case, the scaling factor is irrelevant, and it is sufficient to calculate the log-probability up to a factor:<math display="block">\ln p(C_k \mid x_1, \ldots, x_n) = \ln p(C_k) + \sum_{i=1}^n \ln p(x_i \mid C_k) \underbrace{- \ln Z}_{\text{irrelevant}}</math>The scaling factor is irrelevant, since discrimination subtracts it away:<math display="block">\ln \frac{p(C_k \mid x_1, \ldots, x_n)}{p(C_l \mid x_1, \ldots, x_n)} = \left(\ln p(C_k) + \sum_{i=1}^n \ln p(x_i \mid C_k) \right) - \left(\ln p(C_l) + \sum_{i=1}^n \ln p(x_i \mid C_l) \right)</math>There are two benefits of using log-probability. One is that it allows an interpretation in information theory, where log-probabilities are units of information in [[Nat (unit)|nats]]. Another is that it avoids [[arithmetic underflow]]. === Constructing a classifier from the probability model === The discussion so far has derived the independent feature model, that is, the naive Bayes [[probability model]]. The naive Bayes [[Statistical classification|classifier]] combines this model with a [[decision rule]]. One common rule is to pick the hypothesis that is most probable so as to minimize the probability of misclassification; this is known as the ''[[maximum a posteriori|maximum ''a posteriori'']]'' or ''MAP'' decision rule. The corresponding classifier, a [[Bayes classifier]], is the function that assigns a class label <math>\hat{y} = C_k</math> for some {{mvar|k}} as follows: <math display="block">\hat{y} = \underset{k \in \{1, \ldots, K\}}{\operatorname{argmax}} \ p(C_k) \displaystyle\prod_{i=1}^n p(x_i \mid C_k).</math> [[File:ROC_curves.svg|thumb|[[Likelihood function]]s <math>p(\mathbf{x} \mid Y)</math>, [[Confusion matrix]] and [[ROC curve]]. For the naive Bayes classifier and given that the a priori probabilities <math>p(Y)</math> are the same for all classes, then the [[decision boundary]] (green line) would be placed on the point where the two probability densities intersect, due to {{nowrap|<math>p(Y \mid \mathbf{x}) = \frac{p(Y) \ p(\mathbf{x} \mid Y)}{p(\mathbf{x})} \propto p(\mathbf{x} \mid Y)</math>.}}]]
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)