Avoiding data corruption in backups
Integrity Check
A backup policy can protect your data from malware attacks and system crashes, but first you need to ensure that you are backing up uncorrupted data.
Most home users, and I dare say some system administrators, lack a backup policy. Their family pictures, music collections, and customer data files live on their hard drives and are never backed up to offline storage, only to be lost when the hard drive eventually crashes. A few users know to keep backups and regularly copy their files over to a safe storage medium. But even these conscientious people may find their strategy lacking when the time comes to recover from a system crash and they discover corrupted backup data. A successful backup strategy must involve checking for corrupted data.
Silent Data Corruption
The small number of users who keep copies of their important files only keep a single backup. Often, they use an external storage system, such as Tarsnap or a Nextcloud instance, periodically or continuously synchronizing the important files on their computers with the cloud. While a comfortable approach for end users, a single backup suffers a number of problems. Most importantly, single backups are vulnerable to silent data corruption.
Take for example a folder called Foals
, which is full of pictures of happy young horses. My backup strategy consists of weekly copying the entire folder over to USB mass storage with a tool such as rsync
[1]:
$ rsync -a --delete --checksum Foals/ /path/to/usb/
The rsync
tool synchronizes the contents of /path/to/usb
with the contents of Foals
.
This strategy works until one of the pictures in Foals
gets corrupted. Files get damaged for a number of reasons, such as a filesystem failing to recover properly after an unclean shutdown. Files also may be lost because of human error – you intend to delete Foals/10_foal.jpg
but end up removing Foals/01_foal.jpg
instead without realizing the mistake. If a file gets corrupted or lost and you don't detect the issue before the next backup cycle, rsync
will overwrite the good copy in USB storage with bad data. At this point, all the good copies of the data cease to exist, destroyed by the very backup system intended to protect them.
To mitigate this threat, you can establish a long term storage policy for backups which involves saving your backup to a different folder each week within the USB mass storage. I could therefore keep a current backup of Foals
in a folder called Foals_2022-01-30
, an older backup in Foals_2022-01-23
, and so on. When the backup storage becomes full, I could just delete the older folders to make room for the newer ones. With this strategy, if data corruption happens and it takes me a week to discover it, I may be able to dig up good copies of the files from an older snapshot (Figure 1). See the boxout "The rsync Time Machine" for instructions on how to set up this multi-week backup system.
The rsync Time Machine
With rsync
, you can save backups to a directly attached drive or over a network. As an added convenience, the snapshot of the folder that rsync
takes does not take much space on your storage device.
Suppose I have an external drive mounted under /mnt
. The first snapshot would be saved with a regular invocation of rsync
:
$ mkdir /mnt/Foals_2022-01-23 $ rsync -a Foals/ /mnt/Foals_2022-01-23
The first command creates a directory with a name reflecting the date. The second command copies Foals
to the newly created directory. The -a
switch instructs rsync
to work in "archival" mode, recursively descending into subdirectories, preserving symlinks, time metadata, file permissions, and file ownership data.
When the time comes to make another weekly backup, I create a different backup folder (which references the new current date) and copy Foals
to it. However, rsync
has a trick up its sleeve: The --link-dest
switch tells rsync
to transfer only the changes since the last backup:
$ mkdir /mnt/Foals_2022-01-30 $ rsync -a --link-dest /mnt/Foals_2022-01-23 Foals/ /mnt/Foals_2022-01-30
As a result, rsync
copies any new file to the new backup directory, alongside any file that has been modified since the last backup. Files that have been deleted from the source directory are not copied. For files that exist in the source directory but have not been modified since the last backup, rsync
creates a hard link to the unmodified files' respective copies in the old backup directory rather than copying them to the new backup directory.
The end result is that Foals_2022-01-23
contains a copy of Foals
as it was on that date, while Foals_2022-01-30
contains a current snapshot of Foals
. Because only modified or new files are added to the storage medium, they barely take up any extra space. Everything else is included in the new backup folder via hard links.
Unfortunately, long term storage only works if the data corruption is discovered in time. If your storage medium only has room for storing four snapshots, a particular version of a file will only exist in the backup for four weeks. On the fifth week, the oldest snapshot will be deleted in order to make room for new copies. If the data corruption is not detected within this time window, the good copies of the data will be gone and you will no longer be able to retrieve them from a backup.
Solving for Silent Data Corruption
The first step in guaranteeing a good backup is to verify that you are backing up only uncorrupted data, which is easier said than done. Fortunately, a number of tools exist to help you preserve your data integrity.
Filesystems with checksum support (such as ZFS) offer a reasonable degree of protection against corruption derived from hardware errors. A checksum function takes data, such as a message or a file, and generates a string of text from it. As long as the function is passed the same data, it will generate the same string. If the data gets corrupted in the slightest, the generated string will be different.
ZFS [2], in particular, can verify if a data block is correct upon reading it. If it is not (e.g., as a result of a hard drive defect), ZFS either repairs the data block or throws an error for the user to see.
However, ZFS cannot protect data against human error: If you delete a file by accident with
rm Foals/01_foal.jpg
ZFS has no way of knowing this is a mistake instead of a legitimate operation. If a bogus image editor accidentally damages the picture using valid system calls, ZFS can not differentiate changes caused by software bugs from changes intended by the user. While ZFS is often praised as the ultimate guarantee for data integrity, its impressive capabilities fall short in my opinion.
Protection from Userspace
To verify that the data being backed up is correct, I suggest relying on userspace utilities. While many userspace programs are superb at locating damaged files, they are not easily executable from an arbitrary recovery environment. In a system crash scenario, you may find yourself using something like an obsolete SystemRescue DVD (perhaps from an old Linux Magazine) instead of your normal platform. In keeping with the KISS principle, you should choose userspace tools that are portable and easy to use from any platform.
If your distribution includes the GNU coreutils package (which the vast majority do), you need no fancy tooling.
Ideally, you should verify the files' integrity immediately before the backup is performed. The simplest way of ensuring a given file has not been modified, accidentally or otherwise, is to calculate its checksum and compare the result with the checksum it threw from a known good state (Figure 2). Thus, the first step towards protecting a given folder against corruption is by calculating the checksum of every file in the folder:
$ cd Foals $ find . -type f ! -name '*.md5' -print0 | xargs -0 md5sum | sort -k 2 > md5sums_`date -I`.md5
(See the "Creating a Checksum" box for a more detailed explanation.)
Creating a Checksum
Calculating a checksum is not intuitive, so I will break down the command and explain how it works its magic.
The find
command locates any file (but not directories) in the current folder, excluding files with the .md5
extension. It prints a list of the found files to the standard output. The path of each file is null terminated in order to avoid security issues (which could be derived from piping paths with special characters into the next command):
find . -type f ! -name '*.md5' -print0
Then xargs
just accepts the list provided by the find
command and passes it to the md5sum
program, which generates a checksum for every entry in the list. The -0
switch tells xargs
that find
is passing null-terminated paths to it:
xargs -0 md5sum
The sort
command orders the list (because find
is not guaranteed to deliver sorted results). The output of md5sums
has two columns: The second column contains the path of each file; the first contains its corresponding checksum. Therefore, I pass the -k 2
switch to sort
in order to sort the list using the path names as a criteria:
sort -k 2
These commands create a list of all the files in the Foals
directory, alongside its md5
checksums, and places it under Foals
. The file will have a name dependent on the current date (such as md5sums_2022-01-23.md5
).
If a week later I want to verify that the files are fine, I can issue the same command to generate a new list. Then, it would be easy to check the differences between the state of the Foals
folder on the previous date and the state of the Foals
folder on the current date with the following command:
$ diff md5sums_2022-01-23.md5 md5sums_2022-01-30.md5
The diff
command generates a list of differences between the two files, which will make it easy to spot which files have been changed, added, or removed from Foals
(Figure 3). If a file has been damaged, this command will expose the difference.
Using diff
is only practical if the dataset is small. If you are backing up several files, there are better ways to check that your data is not corrupted. For instance, you can use grep
to list the entries that exist in the old checksum file but not in the new one. In other words: grep
will list the files that have been modified or removed since the last time you performed a check:
$ grep -Fvf md5sums_2022-01-30.md5 md5sums_2022-01-23.md5
The -f md5sums_2022-01-30.md5
option instructs grep
to treat every line of md5sums_2022-01-30.md5
as a target pattern. Any line in md5sums_2022-01-23.md5
that coincides with any of these patterns will be regarded as a match. The -F
option forces grep
to consider patterns as fixed, instead of as regular expressions. Therefore, for a match to be registered, it must be exact. Finally, -v
inverts the matching: Only lines from md5sums_2022-01-23.md5
that match no pattern will be printed.
You can also list the files that have been added since the check was last run with the shell magic in Listing 1.
Listing 1
Newly Added Files
awk '{print $2}' < md5sums_2022-01-30.md5 | while read -r file; do if (! grep $file md5sums_2022-01-23.md5 > /dev/null); then echo "$file is new."; fi; done
With these tools, an integrity verification policy falls into place. In order to ensure you don't populate your backups with corrupted files, you must do the following:
- Generate a list of the files in the dataset and its checksums before initiating the backup.
- Verify this list against the list you generated at the last known good state.
- Identify which changes have happened between the last known good state and the current state, and check if they suggest data corruption.
- If the data is good, back up your files.
A great advantage of this method is that the checksum files can be used to verify the integrity of the backups themselves. For example, if you dumped the backup to /mnt/Foals_2022-01-23
, you could just use a command such as:
$ cd /mnt/Foals_2022-01-23 $ md5sum --quiet -c md5sums_2022-01-23.md5
If any file was missing from the backup or had been modified, this command would reveal the issue right away.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Systemd Fixes Bug While Facing New Challenger in GNU Shepherd
The systemd developers have fixed a really nasty bug amid the release of the new GNU Shepherd init system.
-
AlmaLinux 10.0 Beta Released
The AlmaLinux OS Foundation has announced the availability of AlmaLinux 10.0 Beta ("Purple Lion") for all supported devices with significant changes.
-
Gnome 47.2 Now Available
Gnome 47.2 is now available for general use but don't expect much in the way of newness, as this is all about improvements and bug fixes.
-
Latest Cinnamon Desktop Releases with a Bold New Look
Just in time for the holidays, the developer of the Cinnamon desktop has shipped a new release to help spice up your eggnog with new features and a new look.
-
Armbian 24.11 Released with Expanded Hardware Support
If you've been waiting for Armbian to support OrangePi 5 Max and Radxa ROCK 5B+, the wait is over.
-
SUSE Renames Several Products for Better Name Recognition
SUSE has been a very powerful player in the European market, but it knows it must branch out to gain serious traction. Will a name change do the trick?
-
ESET Discovers New Linux Malware
WolfsBane is an all-in-one malware that has hit the Linux operating system and includes a dropper, a launcher, and a backdoor.
-
New Linux Kernel Patch Allows Forcing a CPU Mitigation
Even when CPU mitigations can consume precious CPU cycles, it might not be a bad idea to allow users to enable them, even if your machine isn't vulnerable.
-
Red Hat Enterprise Linux 9.5 Released
Notify your friends, loved ones, and colleagues that the latest version of RHEL is available with plenty of enhancements.
-
Linux Sees Massive Performance Increase from a Single Line of Code
With one line of code, Intel was able to increase the performance of the Linux kernel by 4,000 percent.