Editing Naive Bayes classifier (section)

== Discussion ==

Despite the fact that the far-reaching independence assumptions are often inaccurate, the naive Bayes classifier has several properties that make it surprisingly useful in practice. In particular, the decoupling of the class conditional feature distributions means that each distribution can be independently estimated as a one-dimensional distribution. This helps alleviate problems stemming from the [[curse of dimensionality]], such as the need for data sets that scale exponentially with the number of features. While naive Bayes often fails to produce a good estimate for the correct class probabilities,<ref>{{cite conference |last1=Niculescu-Mizil |first1=Alexandru |first2=Rich |last2=Caruana |title=Predicting good probabilities with supervised learning |conference=ICML |year=2005 |url=http://machinelearning.wustl.edu/mlpapers/paper_files/icml2005_Niculescu-MizilC05.pdf |doi=10.1145/1102351.1102430 |access-date=2016-04-24 |archive-url=https://web.archive.org/web/20140311005243/http://machinelearning.wustl.edu/mlpapers/paper_files/icml2005_Niculescu-MizilC05.pdf |archive-date=2014-03-11 |url-status=dead }}</ref> this may not be a requirement for many applications. For example, the naive Bayes classifier will make the correct [[Maximum a posteriori estimation|MAP]] decision rule classification so long as the correct class is predicted as more probable than any other class. This is true regardless of whether the probability estimate is slightly, or even grossly inaccurate. In this manner, the overall classifier can be robust enough to ignore serious deficiencies in its underlying naive probability model.<ref name="rish">{{cite conference|last1=Rish|first1=Irina|year=2001|title=An empirical study of the naive Bayes classifier|url=http://www.research.ibm.com/people/r/rish/papers/RC22230.pdf |archive-url=https://ghostarchive.org/archive/20221009/http://www.research.ibm.com/people/r/rish/papers/RC22230.pdf |archive-date=2022-10-09 |url-status=live|conference=IJCAI Workshop on Empirical Methods in AI}}</ref> Other reasons for the observed success of the naive Bayes classifier are discussed in the literature cited below.

===Relation to logistic regression===

In the case of discrete inputs (indicator or frequency features for discrete events), naive Bayes classifiers form a ''generative-discriminative'' pair with [[multinomial logistic regression]] classifiers: each naive Bayes classifier can be considered a way of fitting a probability model that optimizes the joint likelihood <math>p(C, \mathbf{x})</math>, while logistic regression fits the same probability model to optimize the conditional <math>p(C \mid \mathbf{x})</math>.<ref name="pair">{{cite conference |first1=Andrew Y. |last1=Ng |author-link1=Andrew Ng |first2=Michael I. |last2=Jordan |author-link2=Michael I. Jordan |title=On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes |conference=[[Conference on Neural Information Processing Systems|NIPS]] |volume=14 |year=2002 |url=http://papers.nips.cc/paper/2020-on-discriminative-vs-generative-classifiers-a-comparison-of-logistic-regression-and-naive-bayes}}</ref>

More formally, we have the following:
{{Math theorem
| name = Theorem
| note = 
| math_statement = Naive Bayes classifiers on binary features are subsumed by logistic regression classifiers.
}}

{{Math proof|proof=  
Consider a generic multiclass classification problem, with possible classes <math>Y\in \{1, ..., n\}</math>, then the (non-naive) Bayes classifier gives, by Bayes theorem:
<math display="block">p(Y \mid X=x) = \text{softmax}(\{\ln p(Y = k) + \ln p(X=x \mid Y=k)\}_k)</math>

The naive Bayes classifier gives  
<math display="block">\text{softmax}\left(\left\{\ln p(Y = k) + \frac 12 \sum_i (a^+_{i, k} - a^-_{i, k})x_i + (a^+_{i, k} + a^-_{i, k})\right\}_k\right)</math>
where   
<math display="block">a^+_{i, s} = \ln p(X_i=+1 \mid Y=s);\quad a^-_{i, s} = \ln p(X_i=-1 \mid Y=s)</math>

This is exactly a logistic regression classifier.}}

The link between the two can be seen by observing that the decision function for naive Bayes (in the binary case) can be rewritten as "predict class <math>C_1</math> if the [[odds]] of <math>p(C_1 \mid \mathbf{x})</math> exceed those of <math>p(C_2 \mid \mathbf{x})</math>". Expressing this in log-space gives:
<math display="block">
\log\frac{p(C_1 \mid \mathbf{x})}{p(C_2 \mid \mathbf{x})} = \log p(C_1 \mid \mathbf{x}) - \log p(C_2 \mid \mathbf{x}) > 0
</math>

The left-hand side of this equation is the log-odds, or ''[[logit]]'', the quantity predicted by the linear model that underlies logistic regression. Since naive Bayes is also a linear model for the two "discrete" event models, it can be reparametrised as a linear function <math>b + \mathbf{w}^\top x > 0</math>. Obtaining the probabilities is then a matter of applying the [[logistic function]] to <math>b + \mathbf{w}^\top x</math>, or in the multiclass case, the [[softmax function]].

Discriminative classifiers have lower asymptotic error than generative ones; however, research by [[Andrew Ng|Ng]] and [[Michael I. Jordan|Jordan]] has shown that in some practical cases naive Bayes can outperform logistic regression because it reaches its asymptotic error faster.<ref name="pair"/>