Editing CRM114 (program) (section)

== Operation ==
While others have done statistical [[Bayesian spam filtering]] based upon the frequency of single word occurrences in email, CRM114 achieves a higher rate of spam recognition through creating hits based upon phrases up to five words in length. These phrases are used to form a [[Markov Random Field]] representing the incoming texts.  With this additional contextual recognition, it is one of the more accurate spam filters available. Initial testing in 2002 by author Bill Yerazunis<ref>{{Cite web |last=Garretson |first=Cara |date=2007-03-19 |title=The antispam man |url=https://www.networkworld.com/article/2296297/the-antispam-man.html |access-date= |website=Network World |language=en}}</ref> gave a 99.87% accuracy;<ref>{{Cite web |date=2002-10-16 |title=CRM114 gets 99.87% |url=http://www.paulgraham.com/wsy.html |access-date= |website=[[Paul Graham (programmer)|Paul Graham]]'s website}}</ref> Holden <ref name=holden>[https://web.archive.org/web/20050307062526/http://sam.holden.id.au/writings/spam2/ ''Spam Filtering II'']</ref> and [[Text Retrieval Conference|TREC 2005 and 2006]]<ref name="trec14">[http://trec.nist.gov/pubs/trec14/papers/SPAM.OVERVIEW.pdf ''Spam Track Overview'' (2005)] - [[Text Retrieval Conference|TREC 2005]]</ref><ref name="trec15">[http://trec.nist.gov/pubs/trec15/papers/SPAM06.OVERVIEW.pdf ''Spam Track Overview'' (2006)] - [[Text Retrieval Conference|TREC 2005]]</ref> gave results of better than 99%, with significant variation depending on the particular corpus.

CRM114's [[Statistical classification|classifier]] can also be switched to use Littlestone's [[Winnow (algorithm)|Winnow]] algorithm, character-by-character [[correlation]], a variant on KNN ([[K-nearest neighbor algorithm]]) classification called Hyperspace, a bit-entropic classifier that uses [[entropy encoding]] to determine similarity, a [[support vector machine|SVM]], by mutual compressibility as calculated by a modified [[Lempel-Ziv|LZ77]] algorithm, and other more experimental classifiers.  The actual features matched are based on a generalization of [[n-gram|skip-grams]].

The CRM114 algorithms are multi-lingual (compatible with [[UTF-8]] encodings) and null-safe.  A voting set of CRM114 classifiers have been demonstrated to detect confidential versus non-confidential documents written in [[Japanese language|Japanese]] at better than 99.9% detection rate and a 5.3% false alarm rate.<ref>{{cite web |url=https://media.blackhat.com/bh-us-10/whitepapers/Yerazunis/BlackHat-USA-2010-Yerazunis-Confidential-Mail-Filtering-wp.pdf |title=Archived copy |website=media.blackhat.com |archive-url=https://web.archive.org/web/20110708011918/https://media.blackhat.com/bh-us-10/whitepapers/Yerazunis/BlackHat-USA-2010-Yerazunis-Confidential-Mail-Filtering-wp.pdf |archive-date=2011-07-08 |url-status=}}</ref>

CRM114 is a good example of [[pattern recognition]] software, demonstrating how machine learning can be accomplished with a reasonably simple algorithm. The program's C source code is available under the [[GPL]].

At a deeper level, CRM114 is also a string pattern matching language, similar to [[grep]] or even [[Perl]]; although it is [[Turing complete]] it is highly tuned for matching text, and even a simple (recursive) definition of the factorial takes almost ten lines.  Part of this is because the crm114 language syntax is not [[positional]], but [[declension]]al.  As a programming language, it may be used for many other applications aside from detecting spam.  CRM114 uses the [[TRE (computing)|TRE]] approximate-match [[regex]] engine, so it is possible to write programs that do not depend on absolutely identical strings matching to function correctly.

CRM114 has been applied to [[email filtering]] in the KMail client<ref>{{cite web|url=http://www.nnc3.com/mags/LM10/Magazine/Archive/2007/77/074-077_kmail/article.html|title=Removing spam mail with CRM114 and KMail|archive-url=https://web.archive.org/web/20191001092857/http://www.nnc3.com/mags/LM10/Magazine/Archive/2007/77/074-077_kmail/article.html|archive-date=2019-10-01|url-status=live|access-date=2019-10-01}}</ref><ref>{{cite web|url=https://github.com/KDE/kdepim-addons/blob/master//kmail/plugins/common/kmail.antispamrc#L223|title=kmail.antispamrc at KDE/kdepim-addons|website=[[GitHub]]|date=12 June 2022 }}</ref> and a number of other applications, including detection of bots on Twitter and Yahoo,<ref>{{Cite journal |last1=Chu |first1=Zi |last2=Gianvecchio |first2=Steven |last3=Wang |first3=Haining |last4=Jajodia |first4=Sushil |date=November 2012 |title=Detecting Automation of Twitter Accounts: Are You a Human, Bot, or Cyborg? |url=https://ieeexplore.ieee.org/document/6280553 |journal=IEEE Transactions on Dependable and Secure Computing |volume=9 |issue=6 |pages=811–824 |doi=10.1109/TDSC.2012.75 |s2cid=351844 |issn=1545-5971|url-access=subscription }}</ref><ref>{{Cite web |title=Measurement and Classification of Humans and Bots in Internet Chat |url=https://www.usenix.org/legacy/events/sec08/tech/full_papers/gianvecchio/gianvecchio_html/index.html |access-date=2023-01-16 |website=Usenix}}</ref> as well as the first-level filter in the US Dept of Transportation's vehicle defect detection system.<ref>{{Cite report |url=https://www.oig.dot.gov/sites/default/files/NHTSA%20Safety-Related%20Vehicle%20Defects%20-%20Final%20Report%5E6-18-15.pdf |title=Inadequate Data and Analysis Undermine NHTSA's Efforts To Identify and Investigate Vehicle Safety Concerns |last=Scovel III |first=Calvin L. |date=2015-06-18 |publisher=Office of Inspector General - U.S. Department of Transportation}}</ref>  It has also been used as a predictive method for classifying fault-prone software modules.<ref>{{Cite book |last1=Mizuno |first1=Osamu |last2=Ikami |first2=Shiro |last3=Nakaichi |first3=Shuya |last4=Kikuno |first4=Tohru |title=Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007) |chapter=Spam Filter Based Approach for Finding Fault-Prone Software Modules |date=May 2007 |chapter-url=https://ieeexplore.ieee.org/document/4228641 |pages=4 |doi=10.1109/MSR.2007.29|isbn=978-0-7695-2950-9 |s2cid=5867386 }}</ref>