Editing Naive Bayes classifier (section)

===Relation to logistic regression===

In the case of discrete inputs (indicator or frequency features for discrete events), naive Bayes classifiers form a ''generative-discriminative'' pair with [[multinomial logistic regression]] classifiers: each naive Bayes classifier can be considered a way of fitting a probability model that optimizes the joint likelihood <math>p(C, \mathbf{x})</math>, while logistic regression fits the same probability model to optimize the conditional <math>p(C \mid \mathbf{x})</math>.<ref name="pair">{{cite conference |first1=Andrew Y. |last1=Ng |author-link1=Andrew Ng |first2=Michael I. |last2=Jordan |author-link2=Michael I. Jordan |title=On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes |conference=[[Conference on Neural Information Processing Systems|NIPS]] |volume=14 |year=2002 |url=http://papers.nips.cc/paper/2020-on-discriminative-vs-generative-classifiers-a-comparison-of-logistic-regression-and-naive-bayes}}</ref>

More formally, we have the following:
{{Math theorem
| name = Theorem
| note = 
| math_statement = Naive Bayes classifiers on binary features are subsumed by logistic regression classifiers.
}}

{{Math proof|proof=  
Consider a generic multiclass classification problem, with possible classes <math>Y\in \{1, ..., n\}</math>, then the (non-naive) Bayes classifier gives, by Bayes theorem:
<math display="block">p(Y \mid X=x) = \text{softmax}(\{\ln p(Y = k) + \ln p(X=x \mid Y=k)\}_k)</math>

The naive Bayes classifier gives  
<math display="block">\text{softmax}\left(\left\{\ln p(Y = k) + \frac 12 \sum_i (a^+_{i, k} - a^-_{i, k})x_i + (a^+_{i, k} + a^-_{i, k})\right\}_k\right)</math>
where   
<math display="block">a^+_{i, s} = \ln p(X_i=+1 \mid Y=s);\quad a^-_{i, s} = \ln p(X_i=-1 \mid Y=s)</math>

This is exactly a logistic regression classifier.}}

The link between the two can be seen by observing that the decision function for naive Bayes (in the binary case) can be rewritten as "predict class <math>C_1</math> if the [[odds]] of <math>p(C_1 \mid \mathbf{x})</math> exceed those of <math>p(C_2 \mid \mathbf{x})</math>". Expressing this in log-space gives:
<math display="block">
\log\frac{p(C_1 \mid \mathbf{x})}{p(C_2 \mid \mathbf{x})} = \log p(C_1 \mid \mathbf{x}) - \log p(C_2 \mid \mathbf{x}) > 0
</math>

The left-hand side of this equation is the log-odds, or ''[[logit]]'', the quantity predicted by the linear model that underlies logistic regression. Since naive Bayes is also a linear model for the two "discrete" event models, it can be reparametrised as a linear function <math>b + \mathbf{w}^\top x > 0</math>. Obtaining the probabilities is then a matter of applying the [[logistic function]] to <math>b + \mathbf{w}^\top x</math>, or in the multiclass case, the [[softmax function]].

Discriminative classifiers have lower asymptotic error than generative ones; however, research by [[Andrew Ng|Ng]] and [[Michael I. Jordan|Jordan]] has shown that in some practical cases naive Bayes can outperform logistic regression because it reaches its asymptotic error faster.<ref name="pair"/>