Editing Optical character recognition (section)

==Accuracy==
{{update|date=March 2013}}
[[File:Google_Ngrams_(English_2009)_ocurrence_of_laft_and_last.png|thumb|Occurrence of laft and last in Google's [[n-gram]]s database, in English documents from 1700 to 1900, based on OCR scans for the "English 2009" corpus]]
[[File:Google_Ngrams_(English_2012)_ocurrence_of_laft_and_last.png|thumb|Occurrence of laft and last in Google's [[n-gram]]s database, based on OCR scans for the "English 2012" corpus<ref name=":0">{{Cite web |title=Google Books Ngram Viewer |url=https://books.google.com/ngrams/info |access-date=2023-07-20 |website=books.google.com |language=en |quote=When we generated the original Ngram Viewer corpora in 2009, our OCR wasn't as good […]. This was especially obvious in pre-19th century English, where the [[long S|elongated medial-s]] (ſ) was often interpreted as an f, […]. Here's evidence of the improvements we've made since then, using the corpus operator to compare the 2009, 2012 and 2019 versions […]}}</ref>]]
[[File:Google_Ngrams_(English_2019)_long_s_normalization.png|thumb|Searching for words with a [[Long s|long S]] in English 2012 or later are normalized to an S.]]
Commissioned by the [[U.S. Department of Energy]] (DOE), the Information Science Research Institute (ISRI) had the mission to foster the improvement of automated technologies for understanding machine printed documents, and it conducted the most authoritative of the ''Annual Test of OCR Accuracy'' from 1992 to 1996.<ref>{{cite web|url=https://code.google.com/p/isri-ocr-evaluation-tools/|title= Code and Data to evaluate OCR accuracy, originally from UNLV/ISRI|publisher=Google Code Archive}}</ref>

Recognition of typewritten, [[Latin script]] text is still not 100% accurate even where clear imaging is available. One study based on recognition of 19th- and early 20th-century newspaper pages concluded that character-by-character OCR accuracy for commercial OCR software varied from 81% to 99%;<ref>{{cite web
  |url=http://www.dlib.org/dlib/march09/holley/03holley.html
  |access-date=5 January 2014
  |date=April 2009
  |last=Holley
  |first=Rose
  |publisher=D-Lib Magazine
  |title=How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs
}}</ref> total accuracy can be achieved by human review or Data Dictionary Authentication. Other areas{{snd}}including recognition of hand printing, [[cursive]] handwriting, and printed text in other scripts (especially those East Asian language characters which have many strokes for a single character){{snd}}are still the subject of active research. The [[MNIST database]] is commonly used for testing systems' ability to recognize handwritten digits.

Accuracy rates can be measured in several ways, and how they are measured can greatly affect the reported accuracy rate. For example, if word context (a lexicon of words) is not used to correct software finding non-existent words, a character error rate of 1% (99% accuracy) may result in an error rate of 5% or worse if the measurement is based on whether each whole word was recognized with no incorrect letters.<ref>{{Cite conference  | last1=Suen  | first1=C.Y.  | last2= Plamondon    |  first2=R.  | last3= Tappert  | first3=A.  | last4=Thomassen  | first4=A.  | last5=Ward  | first5=J.R.  | last6=Yamamoto  | first6=K.  | title = Future Challenges in Handwriting and Computer Applications  | date =1987-05-29  | conference=3rd International Symposium on Handwriting and Computer Applications, Montreal, May 29, 1987  | url=http://users.erols.com/rwservices/pens/biblio88.html#Suen88  | access-date = 2008-10-03}}</ref> Using a large enough dataset is important in a neural-network-based handwriting recognition solutions. On the other hand, producing natural datasets is very complicated and time-consuming.<ref>{{cite book|title=Comparison of Synthesized and Natural Datasets in Neural Network Based Handwriting Solutions |first1=Maedeh Haji Agha  |last1=Mohseni |first2=Reza |last2=Azmi |first3=Kamran |last3=Layeghi |first4=Sajad |last4=Maleki |date=2019|publisher=ITCT|url=https://civilica.com/doc/924198/certificate/pdf/ |via=Civilica }}</ref>

An example of the difficulties inherent in digitizing old text is the inability of OCR to differentiate between the "[[long s]]" and "f" characters.<ref>{{cite book|title=Research and Advanced Technology for Digital Libraries|author=Kapidakis, Sarantos; Mazurek, Cezary and Werla, Marcin |date=2015|page=257|publisher=Springer|isbn=9783319245928|url=https://books.google.com/books?id=kEyGCgAAQBAJ&q=OCR+and+long+s}}</ref><ref name=":0" />

Web-based OCR systems for recognizing hand-printed text on the fly have become well known as commercial products in recent years{{when|date=March 2013}} (see [[tablet computer|Tablet PC history]]). Accuracy rates of 80% to 90% on neat, clean hand-printed characters can be achieved by [[pen computing]] software, but that accuracy rate still translates to dozens of errors per page, making the technology useful only in very limited applications.{{citation needed|date=May 2009}}

Recognition of [[cursive|cursive text]] is an active area of research, with recognition rates even lower than that of [[hand-printed text]]. Higher rates of recognition of general cursive script will likely not be possible without the use of contextual or grammatical information. For example, recognizing entire words from a dictionary is easier than trying to parse individual characters from script. Reading the ''Amount'' line of a [[cheque|check]] (which is always a written-out number) is an example where using a smaller dictionary can increase recognition rates greatly. The shapes of individual cursive characters themselves simply do not contain enough information to accurately (greater than 98%) recognize all handwritten cursive script.{{citation needed|date=May 2009}}

Most programs allow users to set "confidence rates". This means that if the software does not achieve their desired level of accuracy, a user can be notified for manual review.

An error introduced by OCR scanning is sometimes termed a ''scanno'' (by analogy with the term [[Typographical error|''typo'']]).<ref>{{Cite journal|doi = 10.4155/ppa.15.21|title = Reinventing nonpatent literature for pharmaceutical patenting|year = 2015|last1 = Atkinson|first1 = Kristine H.|journal = Pharmaceutical Patent Analyst|volume = 4|issue = 5|pages = 371–375|pmid = 26389649}}</ref><ref>{{cite web|url=https://www.hoopoes.com/jargon/entry/scanno.shtml|title=scanno|website=Hoopoes|date=May 2001}}</ref>