Editing Optical character recognition (section)

==Techniques==

===Pre-processing===
OCR software often pre-processes images to improve the chances of successful recognition. Techniques include:<ref name="nicomsoft">{{cite web|url=https://www.nicomsoft.com/optical-character-recognition-ocr-how-it-works/ |title=Optical Character Recognition (OCR) – How it works |publisher=Nicomsoft.com |access-date=2013-06-16}}</ref>
* De-[[Skew (fax)|skewing]]{{spaced ndash}}if the document was not aligned properly when scanned, it may need to be tilted a few degrees clockwise or counterclockwise in order to make lines of text perfectly horizontal or vertical.
* [[Despeckle|Despeckling]]{{spaced ndash}}removal of positive and negative spots, smoothing edges
* Binarization{{spaced ndash}}conversion of an image from color or [[greyscale]] to black-and-white (called a [[binary image]] because there are two colors). The task is performed as a simple way of separating the text (or any other desired image component) from the background.<ref name="Sezgin2004">{{cite journal|last1=Sezgin|first1=Mehmet|last2=Sankur|first2=Bulent|date=2004|title=Survey over image thresholding techniques and quantitative performance evaluation|url=http://webdocs.cs.ualberta.ca/~nray1/CMPUT605/track3_papers/Threshold_survey.pdf|journal=Journal of Electronic Imaging|volume=13|issue=1|page=146|bibcode=2004JEI....13..146S|doi=10.1117/1.1631315|archive-url=https://web.archive.org/web/20151016080410/http://webdocs.cs.ualberta.ca/~nray1/CMPUT605/track3_papers/Threshold_survey.pdf|archive-date=October 16, 2015|access-date=2 May 2015}}</ref> The task of binarization is necessary since most commercial recognition algorithms work only on binary images, as it is simpler to do so.<ref name="Gupta2007">{{cite journal|last1=Gupta|first1=Maya R.|last2=Jacobson|first2=Nathaniel P.|last3=Garcia|first3=Eric K.|date=2007|title=OCR binarisation and image pre-processing for searching historical documents.|url=http://www.rfai.li.univ-tours.fr/fr/ressources/_dh/DOC/DocOCR/OCRbinarisation.pdf|journal=Pattern Recognition|volume=40|issue=2|page=389|doi=10.1016/j.patcog.2006.04.043|bibcode=2007PatRe..40..389G|archive-url=https://web.archive.org/web/20151016080410/http://www.rfai.li.univ-tours.fr/fr/ressources/_dh/DOC/DocOCR/OCRbinarisation.pdf|archive-date=October 16, 2015|access-date=2 May 2015}}</ref> In addition, the effectiveness of binarization influences to a significant extent the quality of character recognition, and careful decisions are made in the choice of the binarization employed for a given input image type; since the quality of the method used to obtain the binary result depends on the type of image (scanned document, [[scene text]] image, degraded historical document, etc.).<ref name=Trier1995>{{cite journal|last1=Trier|first1=Oeivind Due|last2=Jain|first2=Anil K.|title=Goal-directed evaluation of binarisation methods.|journal=IEEE Transactions on Pattern Analysis and Machine Intelligence|date=1995|volume=17|issue=12|pages=1191–1201|url=http://heim.ifi.uio.no/inf386/trier2.pdf |archive-url=https://web.archive.org/web/20151016080411/http://heim.ifi.uio.no/inf386/trier2.pdf |archive-date=2015-10-16 |url-status=live|access-date=2 May 2015|doi=10.1109/34.476511}}</ref><ref name="Milyaev2013">{{cite book|last1=Milyaev|first1=Sergey|last2=Barinova|first2=Olga|last3=Novikova|first3=Tatiana|last4=Kohli|first4=Pushmeet|last5=Lempitsky|first5=Victor|title=2013 12th International Conference on Document Analysis and Recognition |chapter=Image Binarization for End-to-End Text Understanding in Natural Images |date=2013|url=https://www.microsoft.com/en-us/research/wp-content/uploads/2016/11/mbnlk_icdar2013.pdf |archive-url=https://web.archive.org/web/20171113184347/https://www.microsoft.com/en-us/research/wp-content/uploads/2016/11/mbnlk_icdar2013.pdf |archive-date=2017-11-13 |url-status=live |pages=128–132|doi=10.1109/ICDAR.2013.33|isbn=978-0-7695-4999-6|s2cid=8947361|access-date=2 May 2015}}</ref>
* Line removal{{spaced ndash}}Cleaning up non-glyph boxes and lines
* [[Document Layout Analysis|Layout analysis]] or zoning{{spaced ndash}}Identification of columns, paragraphs, captions, etc. as distinct blocks. Especially important in [[Column (typography)|multi-column layouts]] and [[Table (information)|tables]].
* Line and word detection{{spaced ndash}}Establishment of a baseline for word and character shapes, separating words as necessary.
* Script recognition{{spaced ndash}}In multilingual documents, the script may change at the level of the words and hence, identification of the script is necessary, before the right OCR can be invoked to handle the specific script.<ref>{{Cite journal |last1=Pati |first1=P.B. |last2= Ramakrishnan |first2=A.G. |title=Word Level Multi-script Identification |date=1987-05-29 |journal=Pattern Recognition Letters |volume=29 |issue=9 |pages=1218–1229  |doi=10.1016/j.patrec.2008.01.027|bibcode=2008PaReL..29.1218P }}</ref>
* Character isolation or segmentation{{spaced ndash}}For per-character OCR, multiple characters that are connected due to image artifacts must be separated; single characters that are broken into multiple pieces due to artifacts must be connected.
* Normalization of [[aspect ratio]] and [[Scale (ratio)|scale]]<ref>{{cite web|url=http://blog.damiles.com/2008/11/20/basic-ocr-in-opencv.html |title=Basic OCR in OpenCV &#124; Damiles |publisher=Blog.damiles.com |access-date=2013-06-16|date=2008-11-20 }}</ref>

Segmentation of [[fixed-pitch font]]s is accomplished relatively simply by aligning the image to a uniform grid based on where vertical grid lines will least often intersect black areas. For [[proportional font]]s, more sophisticated techniques are needed because whitespace between letters can sometimes be greater than that between words, and vertical lines can intersect more than one character.<ref name="Tesseract overview" />

===Text recognition===
There are two basic types of core OCR algorithm, which may produce a ranked list of candidate characters.<ref>{{cite web|url=http://www.dataid.com/aboutocr.htm |title=OCR Introduction |publisher=Dataid.com |access-date=2013-06-16}}</ref>

* ''Matrix matching'' involves comparing an image to a stored glyph on a pixel-by-pixel basis; it is also known as ''pattern matching'', ''[[pattern recognition]]'', or ''[[digital image correlation|image correlation]]''. This relies on the input glyph being correctly isolated from the rest of the image, and the stored glyph being in a similar font and at the same scale. This technique works best with typewritten text and does not work well when new fonts are encountered. This is the technique early physical photocell-based OCR implemented, rather directly.
* ''Feature extraction'' decomposes glyphs into "features" like lines, closed loops, line direction, and line intersections. The extraction features reduces the dimensionality of the representation and makes the recognition process computationally efficient. These features are compared with an abstract vector-like representation of a character, which might reduce to one or more glyph prototypes. General techniques of [[Feature detection (computer vision)|feature detection in computer vision]] are applicable to this type of OCR, which is commonly seen in "intelligent" [[handwriting recognition]] and most modern OCR software.<ref name="ocrwizard">{{cite web|title=How OCR Software Works|url=http://ocrwizard.com/ocr-software/how-ocr-software-works.html|url-status=dead|archive-url=https://web.archive.org/web/20090816210246/http://ocrwizard.com/ocr-software/how-ocr-software-works.html|archive-date=August 16, 2009|access-date=2013-06-16|publisher=OCRWizard}}</ref> [[Nearest neighbour classifiers]] such as the [[k-nearest neighbors algorithm]] are used to compare image features with stored glyph features and choose the nearest match.<ref>{{cite web|url=http://blog.damiles.com/2008/11/14/the-basic-patter-recognition-and-classification-with-opencv.html |title=The basic pattern recognition and classification with openCV &#124; Damiles |publisher=Blog.damiles.com |access-date=2013-06-16|date=2008-11-14 }}</ref>

Software such as [[CuneiForm (software)|Cuneiform]] and [[Tesseract (software)|Tesseract]] use a two-pass approach to character recognition. The second pass is known as adaptive recognition and uses the letter shapes recognized with high confidence on the first pass to better recognize the remaining letters on the second pass. This is advantageous for unusual fonts or low-quality scans where the font is distorted (e.g. blurred or faded).<ref name="Tesseract overview">{{cite web|author=Smith, Ray |year=2007|title=An Overview of the Tesseract OCR Engine|url=http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseracticdar2007.pdf|url-status=dead|archive-url=https://web.archive.org/web/20100928052954/http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseracticdar2007.pdf|archive-date=September 28, 2010|access-date=2013-05-23}}</ref>

{{As of|2016|12}}, modern OCR software includes [[Google Docs]] OCR, [[ABBYY FineReader]], and Transym.<ref>{{Cite journal|last=Assefi|first=Mehdi|date=December 2016|title=OCR as a Service: An Experimental Evaluation of Google Docs OCR, Tesseract, ABBYY FineReader, and Transym|url=https://www.researchgate.net/publication/310645810|journal=ResearchGate}}</ref>{{update inline|date=June 2023}} Others like [[OCRopus]] and Tesseract use [[Artificial neural network|neural networks]] which are trained to recognize whole lines of text instead of focusing on single characters.

A technique known as iterative OCR automatically crops a document into sections based on the page layout. OCR is then performed on each section individually using variable character confidence level thresholds to maximize page-level OCR accuracy. A patent from the United States Patent Office has been issued for this method.<ref>{{Cite web|title=How the Best OCR Technology Captures 99.91% of Data|url=https://www.bisok.com/grooper-data-capture-method-features/multi-pass-ocr/|access-date=2021-05-27|website=www.bisok.com}}</ref>

The OCR result can be stored in the standardized [[ALTO (XML)|ALTO]] format, a dedicated [[XML schema]] maintained by the United States [[Library of Congress]]. Other common formats include [[hOCR]] and [[Page Analysis and Ground Truth Elements|PAGE XML]].

For a list of optical character recognition software, see [[Comparison of optical character recognition software]].

===Post-processing===
OCR accuracy can be increased if the output is constrained by a [[lexicon]]{{spaced ndash}}a list of words that are allowed to occur in a document.<ref name="nicomsoft" /> This might be, for example, all the words in the English language, or a more technical lexicon for a specific field. This technique can be problematic if the document contains words not in the lexicon, like [[proper noun]]s. Tesseract uses its dictionary to influence the character segmentation step, for improved accuracy.<ref name="Tesseract overview" />

The output stream may be a [[plain text]] stream or file of characters, but more sophisticated OCR systems can preserve the original layout of the page and produce, for example, an annotated [[PDF]] that includes both the original image of the page and a searchable textual representation.

''Near-neighbor analysis'' can make use of [[co-occurrence]] frequencies to correct errors, by noting that certain words are often seen together.<ref name="explain">{{cite web|author-first=Chris|author-last=Woodford|author-link=Chris Woodford (author)|url=http://www.explainthatstuff.com/how-ocr-works.html |title=How does OCR document scanning work? |publisher=Explain that Stuff |date=2012-01-30 |access-date=2013-06-16}}</ref> For example, "Washington, D.C." is generally far more common in English than "Washington DOC".

Knowledge of the grammar of the language being scanned can also help determine if a word is likely to be a verb or a noun, for example, allowing greater accuracy.

The [[Levenshtein distance|Levenshtein Distance]] algorithm has also been used in OCR post-processing to further optimize results from an OCR API.<ref>{{cite web|title=How to optimize results from the OCR API when extracting text from an image? - Haven OnDemand Developer Community|url=https://community.havenondemand.com/t5/Wiki/How-to-optimize-results-from-the-OCR-API-when-extracting-text/ta-p/1656|url-status=dead|archive-url=https://web.archive.org/web/20160322103356/https://community.havenondemand.com/t5/Wiki/How-to-optimize-results-from-the-OCR-API-when-extracting-text/ta-p/1656|archive-date=March 22, 2016}}</ref>

===Application-specific optimizations===
In recent years,{{when|date=March 2013}} the major OCR technology providers began to tweak OCR systems to deal more efficiently with specific types of input. Beyond an application-specific lexicon, better performance may be had by taking into account business rules, standard expression,{{clarify|date=March 2013}} or rich information contained in color images. This strategy is called "Application-Oriented OCR" or "Customized OCR", and has been applied to OCR of [[license plate]]s, [[invoice]]s, [[screenshot]]s, [[ID card]]s, [[driver's license]]s, and [[automobile manufacturing]].

''[[The New York Times]]'' has adapted the OCR technology into a proprietary tool they entitle ''Document Helper'', that enables their interactive news team to accelerate the processing of documents that need to be reviewed. They note that it enables them to process what amounts to as many as 5,400 pages per hour in preparation for reporters to review the contents.<ref>{{Cite news |last=Fehr |first=Tiff |date=2019-03-26 |title=How We Sped Through 900 Pages of Cohen Documents in Under 10 Minutes |language=en-US |work=The New York Times |url=https://www.nytimes.com/2019/03/26/reader-center/times-documents-reporters-cohen.html |access-date=2023-06-16 |issn=0362-4331}}</ref>