A modern diff utility
Command Line – diffoscope
With support for more than 60 file formats, diffoscope extends the power of diff beyond the plain text or HTML file.
The first command in Unix-like systems for comparing files and directories was diff
. Originally written by Douglas McIlroy and first appearing in Unix 5th Edition in 1974, diff
rapidly became an essential programming tool. Today, the original command is still available, and most programming languages have their own versions of diff
. However, diff
and its derivatives generally have one limitation: With few exceptions, most of them work only with plain text or markup languages like HTML. A new variation called diffoscope [1], which was released in mid-2020, brings a new level of functionality to file comparison.
Diffoscope is developed primarily by Debian's Reproducible Builds project [2], which aims to increase the robustness and security of Debian packages by ensuring that they always build the same way. Given Debian's nearly 60,000 packages and the variety of hardware available, this is no small task, especially considering that small errors in code can be hard to trace. Diffoscope was written to make this task easier by quickly tracking down differences between two files that are supposed to be identical but perform differently. As a side effect, diffoscope provides a modern diff
utility that works across most programing languages and brings the power of diff
to desktop users and non-programmers, especially writers who wish to compare drafts. Already, diffoscope supports over 60 binary formats that range from files and filesystems to audio and text files, including MS Word, LibreOffice Writer, and PDF (Table 1). And more seem likely to follow.
Table 1
Supported Formats
Android APK files |
LLVM IR bitcode files |
Android boot images |
LZ4 compressed files |
ar(1) archives |
macOS binaries |
Berkeley DB database files |
Microsoft Windows icon files |
bzip2 archives |
Microsoft Word .docx files |
Character/block devices |
Mono Portable Executable files |
ColorSync color profiles (.icc) |
Multimedia metadata |
coreboot CBFS filesystem images |
OCaml interface files |
cpio archives |
Ogg Vorbis audio files |
Dalvik .dex files |
OpenOffice/LibreOffice .odt files |
Directories |
OpenSSH public keys |
Debian buildinfo files |
OpenWRT package archives (.ipk) |
Debian .changes files |
PDF documents |
Debian source packages (.dsc) |
PGP signatures |
Device Tree Compiler blob files |
PGP signed/encrypted messages |
ELF binaries |
PNG images |
ext2/ext3/ext4/Btrfs/FAT filesystems |
PostScript documents |
freedesktop.org fontconfig cache files |
RPM archives |
Free Pascal files (.ppu) |
Rust object files (.deflate) |
gettext message catalogs |
SQLite databases |
GHC Haskell .hi files |
SquashFS filesystems |
GIF image files |
Statically linked binaries |
Git repositories |
Symlinks |
GNU R database files (.rdb) |
Tape archives (.tar) |
GNU R Rscript files (.rds) |
tcpdump capture files (.pcap) |
Gnumeric spreadsheets |
Text files |
Gzipped files |
TrueType font files |
ISO 9660 CD images |
WebAssembly binary module |
Java .class files |
XML binary schemas (.xsb) |
JavaScript files |
XML files |
JPEG images |
XZ compressed files |
JSON files |
Diffoscope's basic command structure is:
diffoscope FILE1 FILE2
If only one file or directory is given, then diffoscope attempts to compare the given file with the last file compared – a desperate act that will only occasionally be useful. For convenience, the command can be piped through less or more. You might also add the --progress
option for large files like DVD images. If you are dealing with large files, you might also run up against the built-in limits for output. Rather than resetting them, you can cancel all of them with the option --no-default-limits
.
Output is to standard output by default, but you can also save to file. The output shows the content of the first file in red text, with each line prefaced by a minus sign, and the content of the second file in white text prefaced by a plus sign. At the top of the output, you'll find statistics that vary with the file type. For example, in Figure 1, the files share LibreOffice's .odt
format, and the statistics are the file names, the amount of text in each file that differs, and the number of total words in each file. By contrast, in Figure 2, a directory diff
is prefaced by file listings, file permissions, and other attributes. The output is driven by context, ensuring that it is useful for more than the diff
itself.
Output Formatting Options
Besides standard input, diffoscope's output can be saved to several file formats. To write output to a text file, add the option --text OUTPUT-FILE
, giving the full path. You can also color-code an output text file with --text-color WHEN
, replacing when with never
, auto
, or always
. Color is enabled automatically in standard output, but disabled by default when you write to a file. Similarly, an HTML file is named with --html OUTPUT-FILE
. Color is not supported for HTML files, but you can write a multi-HTML file using --html OUTPUT-DIRECTORY
, so you can absorb the output in small chunks, and --css URL
to format the output as desired. If you are using JavaScript, both text and HTML output can be formatted using --jquery URL
. Other supported file format options are --json OUTPUT-FILE
, --markdown OUTPUT-FILE
, and --restructured-text OUTPUT_FILE
, all three of which can be used for either files or for standard output. In all these formats, --output-empty
can be used to write a file to report no differences.
Output Limit Options
Coming from an era of memory limitations, diff
is economical, by default writing just a few lines so that the context of a difference can be read. By contrast, diffoscope, written in mid-2020 has limits that are so high that, for all practical purposes, it often has no limits. Instead, if you want to limit diffoscope's output – perhaps to make the output more manageable – you have to deliberately add limits. The number of bytes in an output report is unlimited by default, but you can use --max-text-report-size BYTES
to define a limit. Alternatively, you can use --max-text-report-size BYTES
to change the default of 409,600, or, if using --html OUTPUT-DIRECTORY
, you can use --max-page-size-child BYTES
to change the size of the separate pages of an HTML report from the default of 204,800. Still another alternative is to change the default 1,024 lines for a unified-diff
block – that is, for separate chunks of the report. These options are primarily for comparisons of long files, such as .iso
images, and are generally irrelevant when dealing with files in MS Word or LibreOffice format unless you are comparing complete manuscripts.
Difference Calculation Options
A number of options modify how diffoscope makes its comparisons. --exclude GLOB_PATTERN
and --exclude-command REGEX_PATTERN
are different names for the same option and can be used with either files or directories. When working with directories, you can set whether permissions and other file attributes are used with --exclude-directory-metadata SETTING
, which can be completed with auto, yes, no, or recursive. In addition, you can opt to enable fuzzy logic, controlling how minor differences are handled. A setting of
means that all matches must be exact; however, the meaning of the default of 60
or the maximum of 400
has to be discovered through trial and error, since it is currently undocumented.
Other options are reminiscent of diff
itself setting the number of lines to compare. Use --max-diff-input-lines LINES
to compare the number of lines (the maximum is 4,194,304). You can also set the maximum number of lines per diff
block with --max-diff-block-lines-saved LINES
.
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Fedora Asahi Remix 41 Available for Apple Silicon
If you have an Apple Silicon Mac and you're hoping to install Fedora, you're in luck because the latest release supports the M1 and M2 chips.
-
Systemd Fixes Bug While Facing New Challenger in GNU Shepherd
The systemd developers have fixed a really nasty bug amid the release of the new GNU Shepherd init system.
-
AlmaLinux 10.0 Beta Released
The AlmaLinux OS Foundation has announced the availability of AlmaLinux 10.0 Beta ("Purple Lion") for all supported devices with significant changes.
-
Gnome 47.2 Now Available
Gnome 47.2 is now available for general use but don't expect much in the way of newness, as this is all about improvements and bug fixes.
-
Latest Cinnamon Desktop Releases with a Bold New Look
Just in time for the holidays, the developer of the Cinnamon desktop has shipped a new release to help spice up your eggnog with new features and a new look.
-
Armbian 24.11 Released with Expanded Hardware Support
If you've been waiting for Armbian to support OrangePi 5 Max and Radxa ROCK 5B+, the wait is over.
-
SUSE Renames Several Products for Better Name Recognition
SUSE has been a very powerful player in the European market, but it knows it must branch out to gain serious traction. Will a name change do the trick?
-
ESET Discovers New Linux Malware
WolfsBane is an all-in-one malware that has hit the Linux operating system and includes a dropper, a launcher, and a backdoor.
-
New Linux Kernel Patch Allows Forcing a CPU Mitigation
Even when CPU mitigations can consume precious CPU cycles, it might not be a bad idea to allow users to enable them, even if your machine isn't vulnerable.
-
Red Hat Enterprise Linux 9.5 Released
Notify your friends, loved ones, and colleagues that the latest version of RHEL is available with plenty of enhancements.