PDF creators, extractors, and editors tested
Mutool
The small command-line tool Mutool [5] is part of the simple PDF viewer, MuPDF. Manufacturer Artifex describes it as the "Swiss army knife of PDF manipulation tools." If this is true, something is definitely wrong with Swiss engineering: To be more precise, it can only regenerate the PDF, extract the fonts and images, display some information, and arrange the pages on a giant poster.
Like MuPDF, Mutool is released under the Affero GPL. We looked at version 1.2.2, which can be found in the repositories of Ubuntu 13.10 as the mupdf-tools
package.
Mutool uncomplainingly extracted all the images and the associated text from the documents written with InDesign, LibreOffice, and Scribus. Because the tool had no idea what to do with vector graphics, it only returned the font used in the Inkscape PDF, DejaVu Sans.
Mutool stores the fonts in the file format that it finds in the PDF. In InDesign documents, this meant that PostScript fonts in CFF and CID [6] formats were generated. In contrast, the free applications had embedded TrueType fonts. The exception is Scribus, which embeds fonts as PFA files.
Mutool always provides images in PNG format; the tool arbitrarily converts all other image types. Mutool will open encrypted PDF files if the user tells it the password via an extra parameter.
Poppler Utilities
The pdftotext
command-line tool now belongs to the Poppler Toolbox, which in turn was created as a fork of Xpdf [7]. Most distributions provide Poppler in their repositories; on Ubuntu 13.10, pdftotext
resides in the poppler-utils
package. We tested version 0.24.1.
As its name suggests, pdftotext extracts the text from a PDF document. The results will always require postprocessing: In multicolumn documents, the extraction begins at the top left and stops at the bottom right on a page. The author box in the sample article ended up in the middle of the text, but at least no text was lost.
Using command-line options, you can restrict the analysis to individual pages and rectangular areas. On request, pdftotext will try to keep the layout (Figures 8 and 9). Columns and indentation are simulated with spaces. This makes it possible to read the LibreOffice test document, but spaces hinder further processing.
Although users can also specify the character encoding, many non-standard characters were not recognized in the text generated from the sample documents. Pdftotext had no problems with password protection in the LibreOffice document; users only need to pass in the password with a parameter.
Fishing for Photos
In addition to pdftotext
, the Poppler Tools also include pdfimages
, which extracts images, and pdftohtml
, which converts a PDF into HTML pages. Pdfimages only extracts bitmap images from the PDF and then stores them in PPM format. You need to specify the -j
switch to create JPEG images. As in pdftotext, you can pass in the password; the tool cannot handle vector graphics.
Pdftohtml behaves like a mixture of pdftotext and pdfimages: It extracts the images and dumps the text into one or more HTML files. To avoid a jumble of characters, we explicitly specified character encoding. Pdftohtml will add the links in the PDF document to the HTML results, if so desired.
The extracted text was just as jumbled as pdftotext, and users cannot force the layout in this case. To compensate, the tool can create a "complex document." Here, the layout and the images are transferred into a large PNG image, onto which the browser then superimposes the text (Figure 10). The result is indeed reminiscent of the origin layout, but you can only copy and edit the text. Moreover, pdftohtml totally ignores all vector graphics.
« Previous 1 2 3 4 Next »
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Endless OS 6 has Arrived
After more than a year since the last update, the latest release of Endless OS is now available for general usage.
-
Fedora Asahi 40 Remix Available for Macs with Apple Silicon
If you've been anticipating KDE's Plasma 6 for your Apple Silicon-powered Mac, then you're in luck.
-
Red Hat Adds New Deployment Option for Enterprise Linux Platforms
Red Hat has re-imagined enterprise Linux for an AI future with Image Mode.
-
OSJH and LPI Release 2024 Open Source Pros Job Survey Results
See what open source professionals look for in a new role.
-
Proton 9.0-1 Released to Improve Gaming with Steam
The latest release of Proton 9 adds several improvements and fixes an issue that has been problematic for Linux users.
-
So Long Neofetch and Thanks for the Info
Today is a day that every Linux user who enjoys bragging about their system(s) will mourn, as Neofetch has come to an end.
-
Ubuntu 24.04 Comes with a “Flaw"
If you're thinking you might want to upgrade from your current Ubuntu release to the latest, there's something you might want to consider before doing so.
-
Canonical Releases Ubuntu 24.04
After a brief pause because of the XZ vulnerability, Ubuntu 24.04 is now available for install.
-
Linux Servers Targeted by Akira Ransomware
A group of bad actors who have already extorted $42 million have their sights set on the Linux platform.
-
TUXEDO Computers Unveils Linux Laptop Featuring AMD Ryzen CPU
This latest release is the first laptop to include the new CPU from Ryzen and Linux preinstalled.