Editing Optical character recognition (section)

===Post-processing===
OCR accuracy can be increased if the output is constrained by a [[lexicon]]{{spaced ndash}}a list of words that are allowed to occur in a document.<ref name="nicomsoft" /> This might be, for example, all the words in the English language, or a more technical lexicon for a specific field. This technique can be problematic if the document contains words not in the lexicon, like [[proper noun]]s. Tesseract uses its dictionary to influence the character segmentation step, for improved accuracy.<ref name="Tesseract overview" />

The output stream may be a [[plain text]] stream or file of characters, but more sophisticated OCR systems can preserve the original layout of the page and produce, for example, an annotated [[PDF]] that includes both the original image of the page and a searchable textual representation.

''Near-neighbor analysis'' can make use of [[co-occurrence]] frequencies to correct errors, by noting that certain words are often seen together.<ref name="explain">{{cite web|author-first=Chris|author-last=Woodford|author-link=Chris Woodford (author)|url=http://www.explainthatstuff.com/how-ocr-works.html |title=How does OCR document scanning work? |publisher=Explain that Stuff |date=2012-01-30 |access-date=2013-06-16}}</ref> For example, "Washington, D.C." is generally far more common in English than "Washington DOC".

Knowledge of the grammar of the language being scanned can also help determine if a word is likely to be a verb or a noun, for example, allowing greater accuracy.

The [[Levenshtein distance|Levenshtein Distance]] algorithm has also been used in OCR post-processing to further optimize results from an OCR API.<ref>{{cite web|title=How to optimize results from the OCR API when extracting text from an image? - Haven OnDemand Developer Community|url=https://community.havenondemand.com/t5/Wiki/How-to-optimize-results-from-the-OCR-API-when-extracting-text/ta-p/1656|url-status=dead|archive-url=https://web.archive.org/web/20160322103356/https://community.havenondemand.com/t5/Wiki/How-to-optimize-results-from-the-OCR-API-when-extracting-text/ta-p/1656|archive-date=March 22, 2016}}</ref>