Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Naive Bayes classifier
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Parameter estimation and event models == A class's prior may be calculated by assuming equiprobable classes, i.e., <math>p(C_k) = \frac{1}{K}</math>, or by calculating an estimate for the class probability from the training set: <math display="block">\text{prior for a given class} = \frac{\text{no. of samples in that class}}{\text{total no. of samples}} \,</math> To estimate the parameters for a feature's distribution, one must assume a distribution or generate [[nonparametric]] models for the features from the training set.<ref name="john95">{{cite conference |first1=George H. |last1=John |first2=Pat |last2=Langley |year=1995 |title=Estimating Continuous Distributions in Bayesian Classifiers |conference=Proc. Eleventh Conf. on Uncertainty in Artificial Intelligence |pages=338–345 |publisher=Morgan Kaufmann |arxiv=1302.4964 |url=https://dl.acm.org/doi/10.5555/2074158.2074196 }}</ref> The assumptions on distributions of features are called the "event model" of the naive Bayes classifier. For discrete features like the ones encountered in document classification (include spam filtering), [[Multinomial distribution|multinomial]] and [[Bernoulli distribution|Bernoulli]] distributions are popular. These assumptions lead to two distinct models, which are often confused.<ref name="mccallum">{{cite conference |last1=McCallum |first1=Andrew |first2=Kamal |last2=Nigam |title=A comparison of event models for Naive Bayes text classification |conference=AAAI-98 workshop on learning for text categorization |volume=752 |year=1998 |url=http://www.kamalnigam.com/papers/multinomial-aaaiws98.pdf |archive-url=https://ghostarchive.org/archive/20221009/http://www.kamalnigam.com/papers/multinomial-aaaiws98.pdf |archive-date=2022-10-09 |url-status=live}}</ref><ref>{{cite conference |last1=Metsis |first1=Vangelis |first2=Ion |last2=Androutsopoulos |first3=Georgios |last3=Paliouras |title=Spam filtering with Naive Bayes—which Naive Bayes? |conference=Third conference on email and anti-spam (CEAS) |volume=17 |year=2006|url=https://www.researchgate.net/publication/221650814}}</ref> ===Gaussian naive Bayes=== When dealing with continuous data, a typical assumption is that the continuous values associated with each class are distributed according to a [[Normal distribution|normal]] (or Gaussian) distribution. For example, suppose the training data contains a continuous attribute, '''<math>x</math>'''. The data is first segmented by the class, and then the mean and [[Variance#Estimating the variance|variance]] of <math>x</math> is computed in each class. Let <math>\mu_k</math> be the mean of the values in <math>x</math> associated with class <math>C_k</math>, and let <math>\sigma^2_k</math> be the [[Bessel's correction|Bessel corrected variance]] of the values in <math>x</math> associated with class <math>C_k</math>. Suppose one has collected some observation value <math>v</math>. Then, the probability ''density'' of <math>v</math> given a class <math>C_k</math>, i.e., <math>p(x=v \mid C_k)</math>, can be computed by plugging <math>v</math> into the equation for a [[normal distribution]] parameterized by <math>\mu_k</math> and <math>\sigma^2_k</math>. Formally, <math display="block"> p(x=v \mid C_k) = \frac{1}{\sqrt{2\pi\sigma^2_k}}\,e^{ -\frac{(v-\mu_k)^2}{2\sigma^2_k} } </math> Another common technique for handling continuous values is to use binning to [[Discretization of continuous features|discretize]] the feature values and obtain a new set of Bernoulli-distributed features. Some literature suggests that this is required in order to use naive Bayes, but it is not true, as the discretization may [[Discretization error|throw away discriminative information]].<ref name="idiots"/> Sometimes the distribution of class-conditional marginal densities is far from normal. In these cases, [[kernel density estimation]] can be used for a more realistic estimate of the marginal densities of each class. This method, which was introduced by John and Langley,<ref name="john95"/> can boost the accuracy of the classifier considerably.<ref name="piryonesi2020">{{Cite journal |last1=Piryonesi |first1=S. Madeh |last2=El-Diraby |first2=Tamer E. |date=2020-06-01 |title=Role of Data Analytics in Infrastructure Asset Management: Overcoming Data Size and Quality Problems |journal=Journal of Transportation Engineering, Part B: Pavements |volume=146 |issue=2 |pages=04020022 |doi=10.1061/JPEODX.0000175 |s2cid=216485629}}</ref><ref name="hastie01">{{Cite book |last=Hastie, Trevor. |title=The elements of statistical learning : data mining, inference, and prediction : with 200 full-color illustrations |date=2001 |publisher=Springer |others=Tibshirani, Robert., Friedman, J. H. (Jerome H.) |isbn=0-387-95284-5 |location=New York |oclc=46809224}}</ref> ===Multinomial naive Bayes === With a multinomial event model, samples (feature vectors) represent the frequencies with which certain events have been generated by a [[Multinomial distribution|multinomial]] <math>(p_1, \dots, p_n)</math> where <math>p_i</math> is the probability that event {{mvar|i}} occurs (or {{mvar|K}} such multinomials in the multiclass case). A feature vector <math>\mathbf{x} = (x_1, \dots, x_n)</math> is then a [[histogram]], with <math>x_i</math> counting the number of times event {{mvar|i}} was observed in a particular instance. This is the event model typically used for document classification, with events representing the occurrence of a word in a single document (see [[bag of words]] assumption).<ref>{{cite book |last1=James |first1=Gareth |last2=Witten |first2=Daniela |last3=Hastie |first3=Trevor |last4=Tibshirani |first4=Robert |title=An introduction to statistical learning: with applications in R |date=2021 |publisher=Springer |location=New York, NY |isbn=978-1-0716-1418-1 |page=157 |edition=Second |doi=10.1007/978-1-0716-1418-1 |url=https://link.springer.com/book/10.1007/978-1-0716-1418-1 |access-date=10 November 2024}}</ref> The likelihood of observing a histogram {{math|'''x'''}} is given by: <math display="block"> p(\mathbf{x} \mid C_k) = \frac{(\sum_{i=1}^n x_i)!}{\prod_{i=1}^n x_i !} \prod_{i=1}^n {p_{ki}}^{x_i} </math> where <math>p_{ki} := p(i \mid C_k)</math>. The multinomial naive Bayes classifier becomes a [[linear classifier]] when expressed in log-space:<ref name="rennie">{{cite conference |last1=Rennie |first1=J. |last2=Shih |first2=L. |last3=Teevan |first3=J. |last4=Karger |first4=D. |title=Tackling the poor assumptions of naive Bayes classifiers |conference=ICML |year=2003 |url=http://people.csail.mit.edu/~jrennie/papers/icml03-nb.pdf |archive-url=https://ghostarchive.org/archive/20221009/http://people.csail.mit.edu/~jrennie/papers/icml03-nb.pdf |archive-date=2022-10-09 |url-status=live}}</ref> <math display="block"> \begin{align} \log p(C_k \mid \mathbf{x}) & \varpropto \log \left( p(C_k) \prod_{i=1}^n {p_{ki}}^{x_i} \right) \\ & = \log p(C_k) + \sum_{i=1}^n x_i \cdot \log p_{ki} \\ & = b + \mathbf{w}_k^\top \mathbf{x} \end{align} </math> where <math>b = \log p(C_k)</math> and <math>w_{ki} = \log p_{ki}</math>. Estimating the parameters in log space is advantageous since multiplying a large number of small values can lead to significant rounding error. Applying a log transform reduces the effect of this rounding error. If a given class and feature value never occur together in the training data, then the frequency-based probability estimate will be zero, because the probability estimate is directly proportional to the number of occurrences of a feature's value. This is problematic because it will wipe out all information in the other probabilities when they are multiplied. Therefore, it is often desirable to incorporate a small-sample correction, called [[pseudocount]], in all probability estimates such that no probability is ever set to be exactly zero. This way of [[regularization (mathematics)|regularizing]] naive Bayes is called [[Laplace smoothing]] when the pseudocount is one, and [[Lidstone smoothing]] in the general case.<!-- TODO: cite Jurafsky and Martin for this --> Rennie ''et al.'' discuss problems with the multinomial assumption in the context of document classification and possible ways to alleviate those problems, including the use of [[tf–idf]] weights instead of raw term frequencies and document length normalization, to produce a naive Bayes classifier that is competitive with [[support vector machine]]s.<ref name="rennie"/> ===Bernoulli naive Bayes=== In the multivariate [[Bernoulli distribution|Bernoulli]] event model, features are independent [[Boolean data type|Boolean variables]] ([[binary data|binary variables]]) describing inputs. Like the multinomial model, this model is popular for document classification tasks,<ref name="mccallum"/> where binary term occurrence features are used rather than term frequencies. If <math>x_i</math> is a Boolean expressing the occurrence or absence of the {{mvar|i}}'th term from the vocabulary, then the likelihood of a document given a class <math>C_k</math> is given by:<ref name="mccallum"/> <math display="block"> p(\mathbf{x} \mid C_k) = \prod_{i=1}^n p_{ki}^{x_i} (1 - p_{ki})^{(1-x_i)} </math> where <math>p_{ki}</math> is the probability of class <math>C_k</math> generating the term <math>x_i</math>. This event model is especially popular for classifying short texts. It has the benefit of explicitly modelling the absence of terms. Note that a naive Bayes classifier with a Bernoulli event model is not the same as a multinomial NB classifier with frequency counts truncated to one. ===Semi-supervised parameter estimation=== Given a way to train a naive Bayes classifier from labeled data, it's possible to construct a [[semi-supervised learning|semi-supervised]] training algorithm that can learn from a combination of labeled and unlabeled data by running the supervised learning algorithm in a loop:<ref name="em"/> #Given a collection <math>D = L \uplus U</math> of labeled samples {{mvar|L}} and unlabeled samples {{mvar|U}}, start by training a naive Bayes classifier on {{mvar|L}}. #Until convergence, do: ##Predict class probabilities <math>P(C \mid x)</math> for all examples {{mvar|x}} in <math>D</math>. ##Re-train the model based on the ''probabilities'' (not the labels) predicted in the previous step. Convergence is determined based on improvement to the model likelihood <math>P(D \mid \theta)</math>, where <math>\theta</math> denotes the parameters of the naive Bayes model. This training algorithm is an instance of the more general [[expectation–maximization algorithm]] (EM): the prediction step inside the loop is the ''E''-step of EM, while the re-training of naive Bayes is the ''M''-step. The algorithm is formally justified by the assumption that the data are generated by a [[mixture model]], and the components of this mixture model are exactly the classes of the classification problem.<ref name="em">{{cite journal |first1=Kamal |last1=Nigam |first2=Andrew |last2=McCallum |first3=Sebastian |last3=Thrun |first4=Tom |last4=Mitchell |title=Learning to classify text from labeled and unlabeled documents using EM |journal=[[Machine Learning (journal)|Machine Learning]] |volume=39 |issue=2/3 |pages=103–134 |year=2000 |url=http://www.kamalnigam.com/papers/emcat-aaai98.pdf |archive-url=https://ghostarchive.org/archive/20221009/http://www.kamalnigam.com/papers/emcat-aaai98.pdf |archive-date=2022-10-09 |url-status=live|doi=10.1023/A:1007692713085 |s2cid=686980 |doi-access=free }}</ref>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)