Metadata in ODF Files
Tutorials – ODF Metadata
It is no secret that the native file format of LibreOffice and OpenOffice, the OpenDocument Format (ODF), is a truly open standard for word processing documents, spreadsheets, and presentations. What most people do not know is that ODF files contain lots of metadata that is very easy to read or modify.
Metadata means "data about data." The text messages you exchange using your phone, for example, are a form of data. The people with whom you exchange those messages, when, how often, from where, and so on are metadata about your messaging habits and connections.
Metadata is really important. I once heard French philosopher Bernard Stiegler observe that "the production of metadata has been the principal activity of those in power from the time of the proto-historical empires right up to today."
On a less philosophical and more practical level, lots of metadata is stored in your office documents, and you'll find many valid reasons for messing with the metadata in office files. This tutorial describes the most common of those reasons and offers a general approach to reading and writing metadata in ODF files – an approach that is quite easy and really extendable, because an ODF file is really just a standard ZIP archive of different kinds of plain text or image files.
Why Read and Write ODF Metadata?
Analyzing ODF metadata can help you work better and sometimes learn more about your organization than you thought possible. Editing the same metadata means controlling what everybody else knows about you. Together, these two procedures help to identify and fix many problems, from privacy and security to compliance and indexing. You may, among other things, automatically find, report, and "fix" (see below) ODF files that contain:
- Dangerous, obsolete, or redundant macros
- Information not compliant with your company policies
- Images containing location, author name, or other sensitive information
The raw metadata in ODF files can also be aggregated to create statistics, graphs, or report about whole collections of documents or to feed the same data into some external database. Numeric data that may be averaged goes from word counts to the number and overall duration of edits to each document. This, in turn, may facilitate both simple decisions ("which documents should be updated first?") and more complex ones ("is our team working in the most efficient way?").
On the editing side, you may do the following, for example:
- Normalize and complete metadata (e.g., insert missing author names or titles, all with the same spelling, or change company or department names after a reorganization)
- Hide sensitive data (e.g., remove authors or comments inserted for internal use before sharing documents online, as an ODF, or even as a PDF)
- Add or update disclaimers for compliance with new regulations or company rules
- Add custom properties for better indexing
- Give files names that match the title of the document (or vice versa)
- Insert watermarks into pictures
- Remove metadata from inside pictures
Methodology and Scope
In this tutorial, I introduce a relatively simple way to read or write ODF metadata that works even on systems where LibreOffice or OpenOffice are not installed, including systems running Windows or Mac OS. All you need is support for shell scripts and a few other command-line utilities like grep
, sed
, exiftool
, and ImageMagick: they are all included, or installable as binary packages, on almost every Linux distribution. Besides, this ODF metadata processing approach that you are going to learn can be useful in many other text-processing contexts.
When I say "introduce" or "approach," I mean that, while I provide working code, it is not a complete solution, but rather a collection of examples to use as inspiration and as building blocks for your own ODF metadata problems. One reason for this is that the mere printing of a script that could handle all possible cases with optimal performance would be longer than this whole article.
The other, more important reason is that almost nobody would need such a solution or "top" performance. ODF metadata hacks can save you many days of works, if not many weeks. They did for me. However, unless you really have to process thousands of files every day, you (like me) will only use these hacks in two ways:
- A few times a year, maybe in a different way every time
- Regularly, once per day or less, but as jobs that can run slowly in the background only on the files that have changed since the previous run
In scenarios like these, it is more efficient to put some code together quickly that just works, instead of optimizing it to death. What matters is knowing how to put that code together when the need suddenly arises.
ODF Metadata
Mainly, there are two types of metadata in ODF files. The first consists of the data that you may read or set in the LibreOffice File | Properties tabs shown Figures 1 to 4. Some of those variables are present in every ODF file, others only in certain types, but they are all saved in a file called metadata.xml
inside the ODF ZIP archive.
In addition to this, so to speak, "official" metadata, there is what I would call "hidden" metadata – metadata in, or about, the "non textual" content of an ODF document, which is mainly macros and images. I will now show you how to read, and then write, both types of ODF metadata.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Systemd Fixes Bug While Facing New Challenger in GNU Shepherd
The systemd developers have fixed a really nasty bug amid the release of the new GNU Shepherd init system.
-
AlmaLinux 10.0 Beta Released
The AlmaLinux OS Foundation has announced the availability of AlmaLinux 10.0 Beta ("Purple Lion") for all supported devices with significant changes.
-
Gnome 47.2 Now Available
Gnome 47.2 is now available for general use but don't expect much in the way of newness, as this is all about improvements and bug fixes.
-
Latest Cinnamon Desktop Releases with a Bold New Look
Just in time for the holidays, the developer of the Cinnamon desktop has shipped a new release to help spice up your eggnog with new features and a new look.
-
Armbian 24.11 Released with Expanded Hardware Support
If you've been waiting for Armbian to support OrangePi 5 Max and Radxa ROCK 5B+, the wait is over.
-
SUSE Renames Several Products for Better Name Recognition
SUSE has been a very powerful player in the European market, but it knows it must branch out to gain serious traction. Will a name change do the trick?
-
ESET Discovers New Linux Malware
WolfsBane is an all-in-one malware that has hit the Linux operating system and includes a dropper, a launcher, and a backdoor.
-
New Linux Kernel Patch Allows Forcing a CPU Mitigation
Even when CPU mitigations can consume precious CPU cycles, it might not be a bad idea to allow users to enable them, even if your machine isn't vulnerable.
-
Red Hat Enterprise Linux 9.5 Released
Notify your friends, loved ones, and colleagues that the latest version of RHEL is available with plenty of enhancements.
-
Linux Sees Massive Performance Increase from a Single Line of Code
With one line of code, Intel was able to increase the performance of the Linux kernel by 4,000 percent.