OCR under Linux

Beyond the Basics

Article from Issue 184/2016
Author(s):

Linux OCR software lags behind proprietary applications. We describe some ways to get better results.

Optical character recognition (OCR) is the extraction of text from images. Users often expect OCR to be as straightforward and easy as photocopying, but that is generally true only in the simplest of cases. More often, OCR is a painstakingly slow series of trials and errors, and that is especially true in free software OCR, which lags far behind the leading proprietary applications.

The reasons that OCR is so labor intensive are obvious when you stop to think. At first, an OCR application with more than 98 percent accuracy sounds reliable, but, assuming 300 words per page, that means an average of three to six errors per page. With a complex layout that includes columns and graphics, the number of errors can easily rise to more than 10 per page [1].

To make matters worse, characters like the number one (1) and the lowercase L (l) or the upper or lowercase O (o) and zero (0) can be difficult to distinguish. Other characters, such as the ampersand and question mark, can have a bewildering range of shapes (Figure 1). In some cases, too, short descenders (the part of a letter below the baseline) might cause a "y" to be read as a "v" instead. Similarly, a "d" might be read as an "a" if the ascenders (the part of the letter above the x-height or medium height of letters) are short.

[...]

Use Express-Checkout link below to read the full article (PDF).

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More

News