Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Naive Bayes classifier
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
===Document classification=== Here is a worked example of naive Bayesian classification to the [[document classification]] problem. Consider the problem of classifying documents by their content, for example into [[spamming|spam]] and non-spam [[e-mail]]s. Imagine that documents are drawn from a number of classes of documents which can be modeled as sets of words where the (independent) probability that the i-th word of a given document occurs in a document from class ''C'' can be written as <math display="block">p(w_i \mid C)\,</math> (For this treatment, things are further simplified by assuming that words are randomly distributed in the document - that is, words are not dependent on the length of the document, position within the document with relation to other words, or other document-context.) Then the probability that a given document ''D'' contains all of the words <math>w_i</math>, given a class ''C'', is <math display="block">p(D\mid C) = \prod_i p(w_i \mid C)\,</math> The question that has to be answered is: "what is the probability that a given document ''D'' belongs to a given class ''C''?" In other words, what is <math>p(C \mid D)\,</math>? Now [[Conditional probability|by definition]] <math display="block">p(D\mid C)={p(D\cap C)\over p(C)}</math> and <math display="block">p(C \mid D) = {p(D\cap C)\over p(D)}</math> Bayes' theorem manipulates these into a statement of probability in terms of [[likelihood]]. <math display="block">p(C\mid D) = \frac{p(C)\,p(D\mid C)}{p(D)}</math> Assume for the moment that there are only two mutually exclusive classes, ''S'' and Β¬''S'' (e.g. spam and not spam), such that every element (email) is in either one or the other; <math display="block">p(D\mid S)=\prod_i p(w_i \mid S)\,</math> and <math display="block">p(D\mid\neg S)=\prod_i p(w_i\mid\neg S)\,</math> Using the Bayesian result above, one can write: <math display="block">p(S\mid D)={p(S)\over p(D)}\,\prod_i p(w_i \mid S)</math> <math display="block">p(\neg S\mid D)={p(\neg S)\over p(D)}\,\prod_i p(w_i \mid\neg S)</math> Dividing one by the other gives: <math display="block">{p(S\mid D)\over p(\neg S\mid D)}={p(S)\,\prod_i p(w_i \mid S)\over p(\neg S)\,\prod_i p(w_i \mid\neg S)}</math> Which can be re-factored as: <math display="block">{p(S\mid D)\over p(\neg S\mid D)}={p(S)\over p(\neg S)}\,\prod_i {p(w_i \mid S)\over p(w_i \mid\neg S)}</math> Thus, the probability ratio p(''S'' | ''D'') / p(Β¬''S'' | ''D'') can be expressed in terms of a series of [[likelihood function|likelihood ratios]]. The actual probability p(''S'' | ''D'') can be easily computed from log (p(''S'' | ''D'') / p(Β¬''S'' | ''D'')) based on the observation that p(''S'' | ''D'') + p(Β¬''S'' | ''D'') = 1. Taking the [[logarithm]] of all these ratios, one obtains: <math display="block">\ln{p(S\mid D)\over p(\neg S\mid D)}=\ln{p(S)\over p(\neg S)}+\sum_i \ln{p(w_i\mid S)\over p(w_i\mid\neg S)}</math> (This technique of "[[log-likelihood ratio]]s" is a common technique in statistics. In the case of two mutually exclusive alternatives (such as this example), the conversion of a log-likelihood ratio to a probability takes the form of a [[sigmoid curve]]: see [[logit]] for details.) Finally, the document can be classified as follows. It is spam if <math>p(S\mid D) > p(\neg S\mid D)</math> (i. e., <math>\ln{p(S\mid D) \over p(\neg S\mid D)} > 0</math>), otherwise it is not spam.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)