Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Optical character recognition
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
===Text recognition=== There are two basic types of core OCR algorithm, which may produce a ranked list of candidate characters.<ref>{{cite web|url=http://www.dataid.com/aboutocr.htm |title=OCR Introduction |publisher=Dataid.com |access-date=2013-06-16}}</ref> * ''Matrix matching'' involves comparing an image to a stored glyph on a pixel-by-pixel basis; it is also known as ''pattern matching'', ''[[pattern recognition]]'', or ''[[digital image correlation|image correlation]]''. This relies on the input glyph being correctly isolated from the rest of the image, and the stored glyph being in a similar font and at the same scale. This technique works best with typewritten text and does not work well when new fonts are encountered. This is the technique early physical photocell-based OCR implemented, rather directly. * ''Feature extraction'' decomposes glyphs into "features" like lines, closed loops, line direction, and line intersections. The extraction features reduces the dimensionality of the representation and makes the recognition process computationally efficient. These features are compared with an abstract vector-like representation of a character, which might reduce to one or more glyph prototypes. General techniques of [[Feature detection (computer vision)|feature detection in computer vision]] are applicable to this type of OCR, which is commonly seen in "intelligent" [[handwriting recognition]] and most modern OCR software.<ref name="ocrwizard">{{cite web|title=How OCR Software Works|url=http://ocrwizard.com/ocr-software/how-ocr-software-works.html|url-status=dead|archive-url=https://web.archive.org/web/20090816210246/http://ocrwizard.com/ocr-software/how-ocr-software-works.html|archive-date=August 16, 2009|access-date=2013-06-16|publisher=OCRWizard}}</ref> [[Nearest neighbour classifiers]] such as the [[k-nearest neighbors algorithm]] are used to compare image features with stored glyph features and choose the nearest match.<ref>{{cite web|url=http://blog.damiles.com/2008/11/14/the-basic-patter-recognition-and-classification-with-opencv.html |title=The basic pattern recognition and classification with openCV | Damiles |publisher=Blog.damiles.com |access-date=2013-06-16|date=2008-11-14 }}</ref> Software such as [[CuneiForm (software)|Cuneiform]] and [[Tesseract (software)|Tesseract]] use a two-pass approach to character recognition. The second pass is known as adaptive recognition and uses the letter shapes recognized with high confidence on the first pass to better recognize the remaining letters on the second pass. This is advantageous for unusual fonts or low-quality scans where the font is distorted (e.g. blurred or faded).<ref name="Tesseract overview">{{cite web|author=Smith, Ray |year=2007|title=An Overview of the Tesseract OCR Engine|url=http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseracticdar2007.pdf|url-status=dead|archive-url=https://web.archive.org/web/20100928052954/http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseracticdar2007.pdf|archive-date=September 28, 2010|access-date=2013-05-23}}</ref> {{As of|2016|12}}, modern OCR software includes [[Google Docs]] OCR, [[ABBYY FineReader]], and Transym.<ref>{{Cite journal|last=Assefi|first=Mehdi|date=December 2016|title=OCR as a Service: An Experimental Evaluation of Google Docs OCR, Tesseract, ABBYY FineReader, and Transym|url=https://www.researchgate.net/publication/310645810|journal=ResearchGate}}</ref>{{update inline|date=June 2023}} Others like [[OCRopus]] and Tesseract use [[Artificial neural network|neural networks]] which are trained to recognize whole lines of text instead of focusing on single characters. A technique known as iterative OCR automatically crops a document into sections based on the page layout. OCR is then performed on each section individually using variable character confidence level thresholds to maximize page-level OCR accuracy. A patent from the United States Patent Office has been issued for this method.<ref>{{Cite web|title=How the Best OCR Technology Captures 99.91% of Data|url=https://www.bisok.com/grooper-data-capture-method-features/multi-pass-ocr/|access-date=2021-05-27|website=www.bisok.com}}</ref> The OCR result can be stored in the standardized [[ALTO (XML)|ALTO]] format, a dedicated [[XML schema]] maintained by the United States [[Library of Congress]]. Other common formats include [[hOCR]] and [[Page Analysis and Ground Truth Elements|PAGE XML]]. For a list of optical character recognition software, see [[Comparison of optical character recognition software]].
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)