Insight Into OCR Technology

What better way is there to start off a blog post than with a definition? So here it is: Optical Character Recognition, often abbreviated as OCR, refers to the process that involves capturing and extracting data from a document and transforming that data into text that can be edited and stored.

An OCR Example
Suppose you wanted to digitize a binder full of paper documents. Digitizing all the documents in this binder would enable you to access them on your computer or mobile device, add to or redact from the documents as well as protect them from the dangers of water, fire damage or theft.

Now, you could spend hours, possibly days, at a computer typing the content of each document page by page into a computer. Or, you could scan the binder full of documents and by using OCR; all the documents in the binder could be digitized into text in minutes.

Utilizing OCR 
The process encompassing OCR begins with a document first being scanned using an optical scanner. This scanner reads the page as a bitmap or a pattern of dots, OCR then differentiates between valid images and text. By analyzing the light and dark areas, the background, stroke edge and length of discontinuity between text characters, OCR matches the text from the paper documents to known characters and makes a best guess as to which alphabetic or numeric digit it is, and then converts into digital text.

Over the years, advances in OCR have made it more reliable, producing a minimum of 90% accuracy for average-quality documents and 99% or greater accuracy for the cleanest documents.

A World Without OCR 
When a page of text is scanned into a computer without OCR software, all the computer sees is a bunch graphical bits, or an image. In other words, it has no idea that there is text on the page, much less what the text says. It transforms the scanned document into a digital image. However, an OCR program can convert the characters on the page into a text document that can be read by a word processing program. More advanced OCR programs can even keep the formatting of the document in the conversion.

Optical Character Recognition is a key part of document imaging, scanning and management. Without OCR, the scanner recognizes any document scanned as a whole image instead of lines of text. When a document is considered an image, it cannot be edited, only viewed and stored.

Where is OCR Being Used? OCR is frequently used by libraries and governments to digitize and preserve their books, newspapers, magazines and other paper documents. It is used by companies and enterprises all over the nation because it is the most cost-effective and quick method available for digitizing documents, and each year it frees up tons of storage space that was once home to file cabinets full of paper documents. It is also used to process checks and credit card slips and every day, billions of magazines and letters are sorted by OCR machines, considerably speeding up mail delivery.

