Editing Naive Bayes classifier (section)

== Parameter estimation and event models ==

A class's prior may be calculated by assuming equiprobable classes, i.e., <math>p(C_k) = \frac{1}{K}</math>, or by calculating an estimate for the class probability from the training set:
<math display="block">\text{prior for a given class} = \frac{\text{no. of samples in that class}}{\text{total no. of samples}} \,</math>
To estimate the parameters for a feature's distribution, one must assume a distribution or generate [[nonparametric]] models for the features from the training set.<ref name="john95">{{cite conference |first1=George H. |last1=John |first2=Pat |last2=Langley |year=1995 |title=Estimating Continuous Distributions in Bayesian Classifiers |conference=Proc. Eleventh Conf. on Uncertainty in Artificial Intelligence |pages=338–345 |publisher=Morgan Kaufmann |arxiv=1302.4964 |url=https://dl.acm.org/doi/10.5555/2074158.2074196 }}</ref>

The assumptions on distributions of features are called the "event model" of the naive Bayes classifier. For discrete features like the ones encountered in document classification (include spam filtering), [[Multinomial distribution|multinomial]] and [[Bernoulli distribution|Bernoulli]] distributions are popular. These assumptions lead to two distinct models, which are often confused.<ref name="mccallum">{{cite conference |last1=McCallum |first1=Andrew |first2=Kamal |last2=Nigam |title=A comparison of event models for Naive Bayes text classification |conference=AAAI-98 workshop on learning for text categorization |volume=752 |year=1998 |url=http://www.kamalnigam.com/papers/multinomial-aaaiws98.pdf |archive-url=https://ghostarchive.org/archive/20221009/http://www.kamalnigam.com/papers/multinomial-aaaiws98.pdf |archive-date=2022-10-09 |url-status=live}}</ref><ref>{{cite conference |last1=Metsis |first1=Vangelis |first2=Ion |last2=Androutsopoulos |first3=Georgios |last3=Paliouras |title=Spam filtering with Naive Bayes—which Naive Bayes? |conference=Third conference on email and anti-spam (CEAS) |volume=17 |year=2006|url=https://www.researchgate.net/publication/221650814}}</ref>

===Gaussian naive Bayes===
When dealing with continuous data, a typical assumption is that the continuous values associated with each class are distributed according to a [[Normal distribution|normal]] (or Gaussian) distribution. For example, suppose the training data contains a continuous attribute, '''<math>x</math>'''. The data is first segmented by the class, and then the mean and [[Variance#Estimating the variance|variance]] of <math>x</math> is computed in each class. Let <math>\mu_k</math> be the mean of the values in <math>x</math> associated with class <math>C_k</math>, and let <math>\sigma^2_k</math> be the [[Bessel's correction|Bessel corrected variance]] of the values in <math>x</math> associated with class <math>C_k</math>. Suppose one has collected some observation value <math>v</math>. Then, the probability ''density'' of <math>v</math> given a class <math>C_k</math>, i.e., <math>p(x=v \mid C_k)</math>, can be computed by plugging <math>v</math> into the equation for a [[normal distribution]] parameterized by <math>\mu_k</math> and <math>\sigma^2_k</math>. Formally,
<math display="block">
p(x=v \mid C_k) = \frac{1}{\sqrt{2\pi\sigma^2_k}}\,e^{ -\frac{(v-\mu_k)^2}{2\sigma^2_k} }
</math>

Another common technique for handling continuous values is to use binning to [[Discretization of continuous features|discretize]] the feature values and obtain a new set of Bernoulli-distributed features. Some literature suggests that this is required in order to use naive Bayes, but it is not true, as the discretization may [[Discretization error|throw away discriminative information]].<ref name="idiots"/>

Sometimes the distribution of class-conditional marginal densities is far from normal. In these cases, [[kernel density estimation]] can be used for a more realistic estimate of the marginal densities of each class. This method, which was introduced by John and Langley,<ref name="john95"/> can boost the accuracy of the classifier considerably.<ref name="piryonesi2020">{{Cite journal |last1=Piryonesi |first1=S. Madeh |last2=El-Diraby |first2=Tamer E. |date=2020-06-01 |title=Role of Data Analytics in Infrastructure Asset Management: Overcoming Data Size and Quality Problems |journal=Journal of Transportation Engineering, Part B: Pavements |volume=146 |issue=2 |pages=04020022 |doi=10.1061/JPEODX.0000175 |s2cid=216485629}}</ref><ref name="hastie01">{{Cite book |last=Hastie, Trevor. |title=The elements of statistical learning : data mining, inference, and prediction : with 200 full-color illustrations |date=2001 |publisher=Springer |others=Tibshirani, Robert., Friedman, J. H. (Jerome H.) |isbn=0-387-95284-5 |location=New York |oclc=46809224}}</ref>

===Multinomial naive Bayes ===
With a multinomial event model, samples (feature vectors) represent the frequencies with which certain events have been generated by a [[Multinomial distribution|multinomial]] <math>(p_1, \dots, p_n)</math> where <math>p_i</math> is the probability that event {{mvar|i}} occurs (or {{mvar|K}} such multinomials in the multiclass case). A feature vector <math>\mathbf{x} = (x_1, \dots, x_n)</math> is then a [[histogram]], with <math>x_i</math> counting the number of times event {{mvar|i}} was observed in a particular instance. This is the event model typically used for document classification, with events representing the occurrence of a word in a single document (see [[bag of words]] assumption).<ref>{{cite book |last1=James |first1=Gareth |last2=Witten |first2=Daniela |last3=Hastie |first3=Trevor |last4=Tibshirani |first4=Robert |title=An introduction to statistical learning: with applications in R |date=2021 |publisher=Springer |location=New York, NY |isbn=978-1-0716-1418-1 |page=157 |edition=Second |doi=10.1007/978-1-0716-1418-1 |url=https://link.springer.com/book/10.1007/978-1-0716-1418-1 |access-date=10 November 2024}}</ref> The likelihood of observing a histogram {{math|'''x'''}} is given by:
<math display="block">
p(\mathbf{x} \mid C_k) = \frac{(\sum_{i=1}^n x_i)!}{\prod_{i=1}^n x_i !} \prod_{i=1}^n {p_{ki}}^{x_i}
</math>
where <math>p_{ki} := p(i \mid C_k)</math>. 

The multinomial naive Bayes classifier becomes a [[linear classifier]] when expressed in log-space:<ref name="rennie">{{cite conference |last1=Rennie |first1=J. |last2=Shih |first2=L. |last3=Teevan |first3=J. |last4=Karger |first4=D. |title=Tackling the poor assumptions of naive Bayes classifiers |conference=ICML |year=2003 |url=http://people.csail.mit.edu/~jrennie/papers/icml03-nb.pdf |archive-url=https://ghostarchive.org/archive/20221009/http://people.csail.mit.edu/~jrennie/papers/icml03-nb.pdf |archive-date=2022-10-09 |url-status=live}}</ref>
<math display="block">
\begin{align}
\log p(C_k \mid \mathbf{x}) & \varpropto \log \left( p(C_k) \prod_{i=1}^n {p_{ki}}^{x_i} \right) \\
                       & = \log p(C_k) + \sum_{i=1}^n x_i \cdot \log p_{ki}                 \\
                       & = b + \mathbf{w}_k^\top \mathbf{x}
\end{align}
</math>
where <math>b = \log p(C_k)</math> and <math>w_{ki} = \log p_{ki}</math>. Estimating the parameters in log space is advantageous since multiplying a large number of small values can lead to significant rounding error. Applying a log transform reduces the effect of this rounding error.

If a given class and feature value never occur together in the training data, then the frequency-based probability estimate will be zero, because the probability estimate is directly proportional to the number of occurrences of a feature's value. This is problematic because it will wipe out all information in the other probabilities when they are multiplied. Therefore, it is often desirable to incorporate a small-sample correction, called [[pseudocount]], in all probability estimates such that no probability is ever set to be exactly zero. This way of [[regularization (mathematics)|regularizing]] naive Bayes is called [[Laplace smoothing]] when the pseudocount is one, and [[Lidstone smoothing]] in the general case.<!-- TODO: cite Jurafsky and Martin for this -->

Rennie ''et al.'' discuss problems with the multinomial assumption in the context of document classification and possible ways to alleviate those problems, including the use of [[tf–idf]] weights instead of raw term frequencies and document length normalization, to produce a naive Bayes classifier that is competitive with [[support vector machine]]s.<ref name="rennie"/>

===Bernoulli naive Bayes===
In the multivariate [[Bernoulli distribution|Bernoulli]] event model, features are independent [[Boolean data type|Boolean variables]] ([[binary data|binary variables]]) describing inputs. Like the multinomial model, this model is popular for document classification tasks,<ref name="mccallum"/> where binary term occurrence features are used rather than term frequencies. If <math>x_i</math> is a Boolean expressing the occurrence or absence of the {{mvar|i}}'th term from the vocabulary, then the likelihood of a document given a class <math>C_k</math> is given by:<ref name="mccallum"/>
<math display="block">
p(\mathbf{x} \mid C_k) = \prod_{i=1}^n p_{ki}^{x_i} (1 - p_{ki})^{(1-x_i)}
</math>
where <math>p_{ki}</math> is the probability of class <math>C_k</math> generating the term <math>x_i</math>. This event model is especially popular for classifying short texts. It has the benefit of explicitly modelling the absence of terms. Note that a naive Bayes classifier with a Bernoulli event model is not the same as a multinomial NB classifier with frequency counts truncated to one.

===Semi-supervised parameter estimation===
Given a way to train a naive Bayes classifier from labeled data, it's possible to construct a [[semi-supervised learning|semi-supervised]] training algorithm that can learn from a combination of labeled and unlabeled data by running the supervised learning algorithm in a loop:<ref name="em"/>

#Given a collection <math>D = L \uplus U</math> of labeled samples {{mvar|L}} and unlabeled samples {{mvar|U}}, start by training a naive Bayes classifier on {{mvar|L}}.
#Until convergence, do:
##Predict class probabilities <math>P(C \mid x)</math> for all examples {{mvar|x}} in <math>D</math>.
##Re-train the model based on the ''probabilities'' (not the labels) predicted in the previous step.

Convergence is determined based on improvement to the model likelihood <math>P(D \mid \theta)</math>, where <math>\theta</math> denotes the parameters of the naive Bayes model.

This training algorithm is an instance of the more general [[expectation–maximization algorithm]] (EM): the prediction step inside the loop is the ''E''-step of EM, while the re-training of naive Bayes is the ''M''-step. The algorithm is formally justified by the assumption that the data are generated by a [[mixture model]], and the components of this mixture model are exactly the classes of the classification problem.<ref name="em">{{cite journal |first1=Kamal |last1=Nigam |first2=Andrew |last2=McCallum |first3=Sebastian |last3=Thrun |first4=Tom |last4=Mitchell |title=Learning to classify text from labeled and unlabeled documents using EM |journal=[[Machine Learning (journal)|Machine Learning]] |volume=39 |issue=2/3 |pages=103–134 |year=2000 |url=http://www.kamalnigam.com/papers/emcat-aaai98.pdf |archive-url=https://ghostarchive.org/archive/20221009/http://www.kamalnigam.com/papers/emcat-aaai98.pdf |archive-date=2022-10-09 |url-status=live|doi=10.1023/A:1007692713085 |s2cid=686980 |doi-access=free }}</ref>