Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
CRM114 (program)
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Operation == While others have done statistical [[Bayesian spam filtering]] based upon the frequency of single word occurrences in email, CRM114 achieves a higher rate of spam recognition through creating hits based upon phrases up to five words in length. These phrases are used to form a [[Markov Random Field]] representing the incoming texts. With this additional contextual recognition, it is one of the more accurate spam filters available. Initial testing in 2002 by author Bill Yerazunis<ref>{{Cite web |last=Garretson |first=Cara |date=2007-03-19 |title=The antispam man |url=https://www.networkworld.com/article/2296297/the-antispam-man.html |access-date= |website=Network World |language=en}}</ref> gave a 99.87% accuracy;<ref>{{Cite web |date=2002-10-16 |title=CRM114 gets 99.87% |url=http://www.paulgraham.com/wsy.html |access-date= |website=[[Paul Graham (programmer)|Paul Graham]]'s website}}</ref> Holden <ref name=holden>[https://web.archive.org/web/20050307062526/http://sam.holden.id.au/writings/spam2/ ''Spam Filtering II'']</ref> and [[Text Retrieval Conference|TREC 2005 and 2006]]<ref name="trec14">[http://trec.nist.gov/pubs/trec14/papers/SPAM.OVERVIEW.pdf ''Spam Track Overview'' (2005)] - [[Text Retrieval Conference|TREC 2005]]</ref><ref name="trec15">[http://trec.nist.gov/pubs/trec15/papers/SPAM06.OVERVIEW.pdf ''Spam Track Overview'' (2006)] - [[Text Retrieval Conference|TREC 2005]]</ref> gave results of better than 99%, with significant variation depending on the particular corpus. CRM114's [[Statistical classification|classifier]] can also be switched to use Littlestone's [[Winnow (algorithm)|Winnow]] algorithm, character-by-character [[correlation]], a variant on KNN ([[K-nearest neighbor algorithm]]) classification called Hyperspace, a bit-entropic classifier that uses [[entropy encoding]] to determine similarity, a [[support vector machine|SVM]], by mutual compressibility as calculated by a modified [[Lempel-Ziv|LZ77]] algorithm, and other more experimental classifiers. The actual features matched are based on a generalization of [[n-gram|skip-grams]]. The CRM114 algorithms are multi-lingual (compatible with [[UTF-8]] encodings) and null-safe. A voting set of CRM114 classifiers have been demonstrated to detect confidential versus non-confidential documents written in [[Japanese language|Japanese]] at better than 99.9% detection rate and a 5.3% false alarm rate.<ref>{{cite web |url=https://media.blackhat.com/bh-us-10/whitepapers/Yerazunis/BlackHat-USA-2010-Yerazunis-Confidential-Mail-Filtering-wp.pdf |title=Archived copy |website=media.blackhat.com |archive-url=https://web.archive.org/web/20110708011918/https://media.blackhat.com/bh-us-10/whitepapers/Yerazunis/BlackHat-USA-2010-Yerazunis-Confidential-Mail-Filtering-wp.pdf |archive-date=2011-07-08 |url-status=}}</ref> CRM114 is a good example of [[pattern recognition]] software, demonstrating how machine learning can be accomplished with a reasonably simple algorithm. The program's C source code is available under the [[GPL]]. At a deeper level, CRM114 is also a string pattern matching language, similar to [[grep]] or even [[Perl]]; although it is [[Turing complete]] it is highly tuned for matching text, and even a simple (recursive) definition of the factorial takes almost ten lines. Part of this is because the crm114 language syntax is not [[positional]], but [[declension]]al. As a programming language, it may be used for many other applications aside from detecting spam. CRM114 uses the [[TRE (computing)|TRE]] approximate-match [[regex]] engine, so it is possible to write programs that do not depend on absolutely identical strings matching to function correctly. CRM114 has been applied to [[email filtering]] in the KMail client<ref>{{cite web|url=http://www.nnc3.com/mags/LM10/Magazine/Archive/2007/77/074-077_kmail/article.html|title=Removing spam mail with CRM114 and KMail|archive-url=https://web.archive.org/web/20191001092857/http://www.nnc3.com/mags/LM10/Magazine/Archive/2007/77/074-077_kmail/article.html|archive-date=2019-10-01|url-status=live|access-date=2019-10-01}}</ref><ref>{{cite web|url=https://github.com/KDE/kdepim-addons/blob/master//kmail/plugins/common/kmail.antispamrc#L223|title=kmail.antispamrc at KDE/kdepim-addons|website=[[GitHub]]|date=12 June 2022 }}</ref> and a number of other applications, including detection of bots on Twitter and Yahoo,<ref>{{Cite journal |last1=Chu |first1=Zi |last2=Gianvecchio |first2=Steven |last3=Wang |first3=Haining |last4=Jajodia |first4=Sushil |date=November 2012 |title=Detecting Automation of Twitter Accounts: Are You a Human, Bot, or Cyborg? |url=https://ieeexplore.ieee.org/document/6280553 |journal=IEEE Transactions on Dependable and Secure Computing |volume=9 |issue=6 |pages=811β824 |doi=10.1109/TDSC.2012.75 |s2cid=351844 |issn=1545-5971|url-access=subscription }}</ref><ref>{{Cite web |title=Measurement and Classification of Humans and Bots in Internet Chat |url=https://www.usenix.org/legacy/events/sec08/tech/full_papers/gianvecchio/gianvecchio_html/index.html |access-date=2023-01-16 |website=Usenix}}</ref> as well as the first-level filter in the US Dept of Transportation's vehicle defect detection system.<ref>{{Cite report |url=https://www.oig.dot.gov/sites/default/files/NHTSA%20Safety-Related%20Vehicle%20Defects%20-%20Final%20Report%5E6-18-15.pdf |title=Inadequate Data and Analysis Undermine NHTSA's Efforts To Identify and Investigate Vehicle Safety Concerns |last=Scovel III |first=Calvin L. |date=2015-06-18 |publisher=Office of Inspector General - U.S. Department of Transportation}}</ref> It has also been used as a predictive method for classifying fault-prone software modules.<ref>{{Cite book |last1=Mizuno |first1=Osamu |last2=Ikami |first2=Shiro |last3=Nakaichi |first3=Shuya |last4=Kikuno |first4=Tohru |title=Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007) |chapter=Spam Filter Based Approach for Finding Fault-Prone Software Modules |date=May 2007 |chapter-url=https://ieeexplore.ieee.org/document/4228641 |pages=4 |doi=10.1109/MSR.2007.29|isbn=978-0-7695-2950-9 |s2cid=5867386 }}</ref>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)