Paperwork battles the increasing stacks of paper

Paperless

Article from Issue 166/2014

Author(s): Karsten Günther

Paperwork was developed to manage the paperless office – a dream as old as desktop PCs.

The idea behind Paperwork [1] harks back to the dream of the paperless office: You scan incoming correspondence, invoices, and loose sheets then run them through an optical character recognition (OCR) tool that converts the content into digital form. An application then merges the image data and text in a superimposed form and saves it as a PDF.

Certain pitfalls await, however: For sufficiently good OCR you need the highest quality scans or photographs possible of the text pages. A good scanner with at least 600dpi resolution is preferred, (although 300dpi will work in some cases), and the OCR software needs to be the best fit for the job at hand. When Paperwork launches, it first searches for Tesseract [2]. If the program cannot find this very powerful OCR engine, the program falls back to Cuneiform. In most cases, Tesseract will give better results.

Getting Started

On Arch Linux, you can install Paperwork easily from the AUR. On Ubuntu, you will not currently find Paperwork in the repositories, and there is no PPA. Your best chance is to read the installation manual [3].

Paperwork is essentially based on four components. To scan the documents, Paperwork draws on Sane. Character recognition is handled by Tesseract or Cuneiform. Whoosh [4] indexes the OCR-converted texts so they can be searched easily, and the tool automatically generates suggestions for keywords. Paperwork then merges the whole enchilada into a graphical interface developed with Gtk/Glade.

The preferred Tesseract OCR engine originally came from Hewlett-Packard. Google uses the open source library system, for example, to digitize books [5]. The software excels with its excellent recognition rate and high level of automation. The drawback: Tesseract exclusively processes uncompressed TIFF input files; you thus need to convert documents where necessary.

The Paperless Office

On launch, Paperwork comes up with a clearly designed interface comprising three sections. On the left, you see the current document; next to that are the existing, scanned, and edited pages; on the right is the current page in detail. Like the gscan2pdf PDF scanner [6], Paperwork retrieves documents directly from a connected scanner or loads existing images from the hard disk.

The software merges scanned images to form projects and then exports the projects as PDF files. By default, Paperwork stores the projects in the papers folder in subdirectories named after the current date (e.g., 20140605_1350_31/). It creates several files in these directories: paper.<number>.jpg contains the JPEG images of the scanned page, paper.<number>.words contains the text extracted by the OCR engine.

These files are not stored as plain text files, however, but in the form of special XML files in hOCR format [7] containing the position in the original document in addition to plain text. It is not easy to read these files in a text editor, but you can superimpose the extracted text precisely on the image file. DjVu document format [8], which was specially developed for scanned documents, is based on this design.

Paperwork also stores preview images of the scanned pages in the directory. You can identify them by their thumb name component. Files with labels in their names store manually assigned labels for the document; a file stored as extra.txt additionally contains the keywords you assign.

Paperwork supports multiple sources for loading documents: the application can drive a scanner directly; the program automatically tries to find the scanner via the Sane back end. Alternatively, Paperwork also supports USB-connected webcams, which is usually not a good solution given the typically low resolution and poor quality. On the other hand, Paperwork uses images that have been created in any way as a source, such as screenshots of PDFs. A lack of image quality means the OCR engine rarely delivers useful results in these cases.

Additionally, Paperwork lets you edit PDF files directly. You can load these by selecting Document | Import file(s). If necessary, Paperwork will import several PDFs in one fell swoop – but not recursively from subdirectories. Thus, you need to store the data to be imported in a single directory.

Setting Up OCR

Before you start scanning documents, you need to set up the program (Figure 1). The icon for Settings is fourth from the left in the toolbar. In addition to configuring the working directory, you also configure the scanner and define the language for text recognition. Paperwork stores the settings in the ~/.config/paperwork.conf file, and it writes the index for all scanned documents to ~/.local/share/paperwork/index/.

Figure 1: The Paperwork configuration is limited to a few settings.

The scanner is calibrated in the settings dialog by clicking on the icon on the right. Paperwork then starts a scan, which it uses as the basis for further input to the device. How well this works depends to some extent on the fonts used.

Figure 2 shows an example in which the Paperwork OCR engine almost completely converted the text despite scanning at an angle. To see the words that were deciphered (in the blue frames), select Document | Advanced | Highlight all words. It is up to you to decide whether the plain text is accurate. In Figure 3, Paperwork tries its hand with a PDF generated by OpenOffice. This actually provides better conditions than a scanned document, but the result shows that many words were not recognized, as you can see from the number of words that lack blue boxes. Often, you can optimize the results by delimiting the area processed by the OCR engine in Document | Edit (Figure 4); however, this means a new, time-consuming OCR run each time you make a change.

Figure 2: Paperwork's OCR achieved good hit rates, even with poorly aligned documents.

Figure 3: Text passages without blue boxes were not identified as text by the Paperwork OCR feature.

Figure 4: You can narrow down the area to be processed in the image to optimize the OCR results.

1 2 Next »

Buy this article as PDF

Express-Checkout as PDF

Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES

Print Issues

Digital Issues

SUBSCRIPTIONS

Print Subs

Digisubs

TABLET & SMARTPHONE APPS

US / Canada

UK / Australia

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

News

Proton 9.0-1 Released to Improve Gaming with Steam

Games , Linux , open source , Steam

The latest release of Proton 9 adds several improvements and fixes an issue that has been problematic for Linux users.
So Long Neofetch and Thanks for the Info

bash , Hardware , scripting , shell , Software , Tools

Today is a day that every Linux user who enjoys bragging about their system(s) will mourn, as Neofetch has come to an end.
Ubuntu 24.04 Comes with a “Flaw"

Linux , Snap , Software , Ubuntu

If you're thinking you might want to upgrade from your current Ubuntu release to the latest, there's something you might want to consider before doing so.
Canonical Releases Ubuntu 24.04

Gnome , Linux , open source , Ubuntu

After a brief pause because of the XZ vulnerability, Ubuntu 24.04 is now available for install.
Linux Servers Targeted by Akira Ransomware

Enterprise Linux , Linux , ransomware , Security

A group of bad actors who have already extorted $42 million have their sights set on the Linux platform.
TUXEDO Computers Unveils Linux Laptop Featuring AMD Ryzen CPU

Games , Hardware , laptop , Linux

This latest release is the first laptop to include the new CPU from Ryzen and Linux preinstalled.
XZ Gets the All-Clear

Arch Linux , Fedora , Linux , open source , Security , Ubuntu

The back door xz vulnerability has been officially reverted for Fedora 40 and versions 38 and 39 were never affected.
Canonical Collaborates with Qualcomm on New Venture

Artificial Inte... , Linux , open source , Security , Ubuntu

This new joint effort is geared toward bringing Ubuntu and Ubuntu Core to Qualcomm-powered devices.
Kodi 21.0 Open-Source Entertainment Hub Released

audio , Multimedia , Music , open source , streaming video , Video

After a year of development, the award-winning Kodi cross-platform, media center software is now available with many new additions and improvements.
Linux Usage Increases in Two Key Areas

Games , Linux , open source , Steam

If market share is your thing, you'll be happy to know that Linux is on the rise in two areas that, if they keep climbing, could have serious meaning for Linux's future.

Paperwork battles the increasing stacks of paper