OCR under Linux
Beyond the Basics
Linux OCR software lags behind proprietary applications. We describe some ways to get better results.
Optical character recognition (OCR) is the extraction of text from images. Users often expect OCR to be as straightforward and easy as photocopying, but that is generally true only in the simplest of cases. More often, OCR is a painstakingly slow series of trials and errors, and that is especially true in free software OCR, which lags far behind the leading proprietary applications.
The reasons that OCR is so labor intensive are obvious when you stop to think. At first, an OCR application with more than 98 percent accuracy sounds reliable, but, assuming 300 words per page, that means an average of three to six errors per page. With a complex layout that includes columns and graphics, the number of errors can easily rise to more than 10 per page [1].
To make matters worse, characters like the number one (1) and the lowercase L (l) or the upper or lowercase O (o) and zero (0) can be difficult to distinguish. Other characters, such as the ampersand and question mark, can have a bewildering range of shapes (Figure 1). In some cases, too, short descenders (the part of a letter below the baseline) might cause a "y" to be read as a "v" instead. Similarly, a "d" might be read as an "a" if the ascenders (the part of the letter above the x-height or medium height of letters) are short.
In fact, even if the application reads the character set, a font with thin lines or one that has been manually kerned or has anything except a horizontal baseline can be difficult to interpret. The darkness of letters and their background can also affect the success of OCR.
In the case of free software, such difficulties are compounded by a relative lack of attention to OCR. Projects like GOCR [2] or Ocrad [3] proceed so slowly that at times they appear to be inactive. Today, most OCR under Linux depends on Tesseract [4] or CuneiForm [5]. The accuracy of both is roughly equivalent for blocks of text (Figure 2), but CuneiForm tends to be less accurate on highly formatted text (Figure 3), and some users may prefer to avoid CuneiForm because its code is only partially released under a free license. Other OCR applications exist, such as YAGF [6], but they are only front ends for Tesseract or CuneiForm. For better or worse, free software OCR remains primarily at the command line.
Working with Tesseract
Tesseract was first developed by Hewlett Packard from 1985 to 1996. Little work was done on it for a decade, until the code was housed by Google in 2006. It is now housed on GitHub. Tesseract generally installs with an English language pack, but you can also download almost 50 other languages. In fact, much of the recent development work on Tesseract seems to consist of adding languages.
I keep hearing rumors that Tesseract supports multiple graphics formats. However, the versions available in Debian support only .tif
images. If you are extracting text from another format, use the ImageMagick convert
utility first, which is installed in many distributions by default.
To use the convert
utility, enter the original file name and a name for the output file. For example:
convert ORIGINAL OUTPUT
When you have a .tif
image, text exaction can also be straightforward:
tesseract FILE.tif OUTPUT.txt
The output is produced with no indication of progress except a return to the command prompt when the process is complete. Output is to plain text, making Tesseract a salvage tool, rather than a means to reproduce the original format.
However, you can also add a few options to the basic command. With -l LANGUAGE
, you can specify a language other than English, using the abbreviations given in the man page. Multiple languages can be listed if necessary.
Another useful option is -psm NUMBER
, which sets how Tesseract operates, as shown in Table 1. Depending on the image, you might want to try one of these options in the hopes of getting more accurate results.
Table 1
Tesseract Options
|
|
|
|
|
|
|
|
|
|
|
Tesseract also supports the option -c configvar=VALUE
, which can be added multiple times to use multiple options. However, the only list of configuration variables I have been able to find is a partial one from an outdated Google page [7]; most of the variables are for Japanese, none of which are likely to improve accuracy for English. Perhaps the option is primarily for future development, but, for now, Tesseract either works or it doesn't. If it doesn't, --psm NUMBER
is the only tool within Tesseract itself that might improve accuracy.
Working with CuneiForm
CuneiForm is a mixture of freeware and software released under a BSD license. For this reason, in Debian and many of its derivatives, CuneiForm is classified as non-free and will not appear in your list of available packages unless the non-free section of the repositories is enabled.
CuneiForm's basic command structure is even more straightforward than Tesseract's:
cuneiform FILE
However, CuneiForm has several advantages. To start, CuneiForm supports most common graphics format, so in most cases you have no need to convert the original file. Unless you specify an output file, it writes to cuneiform-out.EXTENSION
, although with -o OUTPUT
, you can give the output a different name. Its default output, like Tesseract's, is plain text, but, you can also complete the -f FORMAT
option with <code>html<code>
and </code>rtf</code>
. For simple text, you may also be able to improve CuneiForm's accuracy for articles, essays, and many other genres with --singlecolumn
.
For non-English speakers, CuneiForm's main disadvantage is that it supports only half of the languages that Tesseract does. For all users, CuneiForm may also have the disadvantage of being unstable. In my experience, it has an alarming tendency to end in segmentation faults.
Improving OCR Accuracy
CuneiForm includes options for --dotmatrix
and --fax
, both of which can sometimes help it read other text that is fragmented or faint. Otherwise, with both CuneiForm and Tesseract, efforts to increase their accuracy requires editing the original graphic – or, safer still – a copy of the original.
Using ImageMagick's convert
utility or an editor like GIMP, you can sometimes get better results by:
- Increasing the contrast
- Changing the background color
- Reducing a complex background to a single color
- Converting the image to grayscale
- Increasing the size of the image
- Increase the resolution (dpi)
Of all these edits, increasing the resolution generally has the best results. That is especially true if the image is a screenshot, which is rarely more than 120dpi, and may be 96dpi or lower. Greatly increasing the resolution – sometimes as high as 5000dpi – can often be effective, although with large images, such resolutions can seriously slow or even prevent the handling of the file.
You can also try different combinations of these edits, depending on the circumstances.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
AlmaLinux OS Kitten 10 Gives Power Users a Sneak Preview
If you're looking to kick the tires of AlmaLinux's upstream version, the developers have a purrfect solution.
-
Gnome 47.1 Released with a Few Fixes
The latest release of the Gnome desktop is all about fixing a few nagging issues and not about bringing new features into the mix.
-
System76 Unveils an Ampere-Powered Thelio Desktop
If you're looking for a new desktop system for developing autonomous driving and software-defined vehicle solutions. System76 has you covered.
-
VirtualBox 7.1.4 Includes Initial Support for Linux kernel 6.12
The latest version of VirtualBox has arrived and it not only adds initial support for kernel 6.12 but another feature that will make using the virtual machine tool much easier.
-
New Slimbook EVO with Raw AMD Ryzen Power
If you're looking for serious power in a 14" ultrabook that is powered by Linux, Slimbook has just the thing for you.
-
The Gnome Foundation Struggling to Stay Afloat
The foundation behind the Gnome desktop environment is having to go through some serious belt-tightening due to continued financial problems.
-
Thousands of Linux Servers Infected with Stealth Malware Since 2021
Perfctl is capable of remaining undetected, which makes it dangerous and hard to mitigate.
-
Halcyon Creates Anti-Ransomware Protection for Linux
As more Linux systems are targeted by ransomware, Halcyon is stepping up its protection.
-
Valve and Arch Linux Announce Collaboration
Valve and Arch have come together for two projects that will have a serious impact on the Linux distribution.
-
Hacker Successfully Runs Linux on a CPU from the Early ‘70s
From the office of "Look what I can do," Dmitry Grinberg was able to get Linux running on a processor that was created in 1971.