Zack's Kernel News
Zack's Kernel News
Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.
Extending Capabilities This Way and That
Eric W. Biederman wanted to find a good way to give one process a subset of the capabilities available to another process. Capabilities is a finer-grained set of permissions than regular ownership and read/write/execute permissions. It was defined in a POSIX draft that never became official, so the true nature and behavior of Unix capabilities has no official standard. It makes for craziness.
Eric's idea was to define a couple of new system calls, called fsyscall()
and fsyscall_create()
. The main call, fsyscall()
, would take another system call number as one parameter and a set of capabilities as another; then, it would run the targeted system call using the given set of capabilities, rather than those of the calling process.
He asked Andy Lutomirski's opinion, but Andy pointed out that system call numbers were not as clean as Eric thought. Specifically, Not all system calls were directly callable by number using the ABI (application binary interface). Andy said he was actually working on the problem from almost the opposite angle, to "have a way that one task can trap into a special mode in which another process can do syscalls on its behalf."
Eric said he thought Andy's feature might be vulnerable to attack, but they continued talking about Eric's approach. Eric clarified the question he was really trying to answer with his patch: "How do we create an object that we can pass between processes that limits what we can do in the case of the oddball syscalls that require special privileges."
In another email, he clarified:
"Where I am focusing is turning Posix capabilities into real capabilities. I would not mind if the functionality was a bit more general. Say to be able to handle things like security labels, or anywhere else you might reasonably be asked can you do X?
"But I would be happy if we just managed to wrap the Posix capabilities and turned them into real capabilities."
The distinction between POSIX capabilities and "real" capabilities is one that's been part of the capability conundrum since the beginning. POSIX capabilities divvy up root privileges into discrete abilities that can be granted or refused individually. "Real" capabilities is a computer science idea that predates POSIX capabilities. It refers to the general idea of giving a process a token to indicate that it has permission to perform a particular action on a particular object.
There was not much more discussion, but it looks like Eric and Andy are trying to address different use cases. The whole issue of capabilities has many thorns. For one thing, there's always the danger that a new feature or refactoring effort might open up security holes, and plugging those holes can tend to make features look convoluted or arbitrary. Then, there is the whole lack of a clear standard, which at the very least means that the version of capabilities found in the abandoned POSIX draft may not be a great authority to rely on.
So Eric, Andy, and anyone else working in this area are truly defining a frontier of Linux development, and the contours of the final outcome just can't be seen at all at this point.
Plugging Security Weaknesses Built into Old Compatibility Features
Andy Lutomirski pointed out that "Linux has a handful of weird features that are only supported for backwards compatibility. The big one is the x86_64 vsyscall page, but uselib probably belongs on the list, too, and we might end up with more at some point."
He wanted to figure out a way for this dumping ground of features to become invisible to sufficiently new software. The features existed only to support old code, and new code should really never use them at all; so why even see them?
Of course, Andy's idea couldn't require old software to do anything specific to enable those compatibility features, because the whole idea is that old software is old and isn't being updated anymore. So, as he put it, "I'd like to add a way that new programs can turn these features off." He added, "I want the vsyscall page to be completely gone from the perspective of any new enough program. This is straightforward if we add a system call to ask for the vsyscall page to be disabled, but I'm wondering if we can come up with a non-syscall way to do it."
Or, as an even better possibility, he said that "the ideal behavior would be that anything linked against a sufficiently new libc would be detected, but I don't see a good way to do that using existing toolchain features."
Brian Gerst poured cold water on the idea, saying, "The vsyscall page is mapped in the fixmap region, which is shared between all processes. You can't turn it off for an individual process."
But Andy disagreed. He pointed out that while vsyscall might be universally shared, "We already emulate all attempts to execute it, and that's trivial to turn of per process."
Rich Felker asked if there was any actual real-world problem Andy was trying to solve with this. He pointed out that "the vsyscall nonsense is fully emulated now and that the ways it could be used as an attack vector have been mitigated." But Andy pointed out that Google's Project Zero was still finding exploits [1]. Moreover, he saw the potential for other exploits as well. He wanted to nail vsyscall down for good.
Rich suggested a new possibility – just have no mapping for vsyscall at all. If any program tried to access it, there would be a page fault. The kernel could then catch the fault and emulate vsyscall as needed. Presto. No need for compile-time detection, or special calls by new software, or anything like that.
Andy objected to this, saying that some modern dynamic instrumentation tools needed to be able to read the targets of calls and that Rich's idea would break compatibility for those programs.
Rich was incredulous. He said, "do people seriously need to do this dynamic instrumentation on ancient obsolete binaries? This sounds to me like confused requirements." But Andy pointed out that he received at least one bug report from someone doing exactly that a couple of years back. But he acknowledged that, "as long as we have a way to distinguish new and old binaries, it's not that much harder to twiddle vsyscall readability per process than it is to twiddle vsyscall executability per process."
But Rich still didn't like this idea. He said, "But we don't have a (reasonable) way to distinguish new and old binaries, at least not at the right point in history. If we're adding a new header or whatnot, only bleeding-edge binaries will benefit from it. All existing binaries from the past N years that don't need the vsyscall nonsense will still get it unnecessarily, and still be subject to the risks."
Rich also pointed out:
Since the only calls to vsyscall are from glibc, it seems to me that the only ways vsyscall can be needed are:
1. The user is running old glibc, in which case all dynamic-linked programs need it.
2. The user is running old static-linked glibc binaries. Almost nobody does this. During the era of vsyscall, static linking was all but deprecated.
3. The user is running old binaries using a custom library path with old glibc in it. This is almost certainly just a bogus setup since glibc's symbol versioning is supposed to make old binaries run fine with a newer libc.so.
None of these seem to be use cases that we should be engineering complex solutions for. For case 1, the solution wouldn't help anyway since all programs need vsyscall. For cases 2 and 3, if the user wants to harden their system so that newer binaries are not affected by vsyscall, they should just remove vsyscall and fix their old binaries/libraries. In case 2, in particular, you can assume the ability to re-link with an updated glibc; otherwise, there's an LGPL violation going on.
The discussion petered out at this point. But it's clear that Andy's main goal is not to clean up old interfaces arbitrarily, so much as to eliminate potential security holes. So on that level, I think there'll eventually be a way to disable things like vsyscall for modern software. The exact mechanism to accomplish that doesn't seem clearly known yet, but as Rich pointed out in a different email, the best approach may be to just let modern systems disable vsyscall at kernel compile time if they don't need legacy software. Eventually vsyscall would age out that way, and there'd be no more problem.
Deprecating the Linux Framebuffer
Tomi Valkeinen pointed out that although fbdev (the Linux graphical framebuffer) was still maintained, it had been deprecated in favor of the DRM (Direct Rendering Manager) subsystem. Because of that, Tomi asked developers to stop submitting new fbdev drivers and to work through DRM instead.
Tomi said he'd continue to maintain the current set of fbdev drivers and would even add new features to those drivers if the requests were not extravagant, but, for example, the three new fbdev drivers (xgifb, fbtft, and sm750fb) that were currently in staging didn't have much future, in his opinion. Tomi said they looked either old, required custom APIs, or were intended to provide another layer of abstraction on top of fbdev. Not going to happen. He suggested simply removing them from the staging area and leaving them out of the kernel entirely.
Thomas Petazzoni pointed out that for very simple hardware devices, it was still overly complicated to write good DRM drivers. He pointed out that "fbtft mainly drives some very simple I2C-based or SPI-based displays, and DRM is I believe overkill for such displays. Last time I talked with Laurent Pinchart about such drivers, I believe he said that such simple drivers could probably continue to use the fbdev subsystem."
But Daniel Vetter replied, "drm already has piles of really simple drivers with just one output and one framebuffer. There's no reason not to use drm for gfx drivers at all."
There were some exclamations of surprise over that point, but folks accepted it. And Ondrej Zary asked, "Is there a simple way to convert existing fbdev drivers to DRM? Let's say I want to convert tridentfb to DRM, keeping the 2D acceleration (pan, fillrect, copyarea, imageblit) to be usable by the console (and maybe extend it to X11 using some generic 2D driver?)"
Daniel replied:
DRM doesn't do generic 2d accel, it's all driver specific. And consensus for 2d accel (at least in X) is pretty much that if you have a 3d gpu use glamour. If you don't have that then use the cpu. There's a hint for drm userspace whether to use shadowfb for cpu rendering or not.
What you can do though if you want is keep your accel code for the fbdev emulation on top of the drm modesetting driver, there's a few oddball drivers who do that. And panning is of course already supported by the modeset api.
Meanwhile Aaro Koskinen said that he was still planning to do work on xgifb because he need it for console support on some of his systems. But Ondrej pointed out that half of the devices supported by xgifb were already supported by the DRM driver sisfb, so it made more sense to port the others over to sisfb as well, rather than keep xgifb afloat on a deprecated subsystem.
Elsewhere, Geert Uytterhoeven asked Daniel to give a list of the "really simple" DRM drivers. Alex Deucher listed the tilcdc, ast, mgag200, and udl drivers. But Geert pointed out that the smallest of those was still 2,800 lines of code, whereas some fbdev drivers clocked in at 200 lines of code (LoC).
David Herrmann felt that counting lines of code was not really the key issue, but Geert replied, "LoC is not the most important. But if the smallest DRM driver needs an order of magnitude more LoC than the smallest fbdev driver, I start to wonder. E.g. if I want to write a new simple driver for my new shiny hardware, it can make a big difference if I have to write (and test/debug) 800 LoC, or 3000 LoC."
Geert added, "from the figures above, I don't think we're at that point yet that writing a new DRM driver is less/equal amount of work than writing a new fbdev driver, at least for some classes of hardware. So it may be a bit premature to put a moratorium on new fbdev drivers."
The discussion went on for a bit, but the upshot is that fbdev is still deprecated, and folks are motivated to make DRM support simpler hardware more gently. It's a nice discussion to watch, because everyone seemed to have a fair sense of balance and of the need to continue to support features that were actually needed in the wild, while at the same time allowing older code to age gracefully out of the kernel.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Linux Kernel 6.13 Offers Improvements for AMD/Apple Users
The latest Linux kernel is now available, and it includes plenty of improvements, especially for those who use AMD or Apple-based systems.
-
Gnome 48 Debuts New Audio Player
To date, the audio player found within the Gnome desktop has been meh at best, but with the upcoming release that all changes.
-
Plasma 6.3 Ready for Public Beta Testing
Plasma 6.3 will ship with KDE Gear 24.12.1 and KDE Frameworks 6.10, along with some new and exciting features.
-
Budgie 10.10 Scheduled for Q1 2025 with a Surprising Desktop Update
If Budgie is your desktop environment of choice, 2025 is going to be a great year for you.
-
Firefox 134 Offers Improvements for Linux Version
Fans of Linux and Firefox rejoice, as there's a new version available that includes some handy updates.
-
Serpent OS Arrives with a New Alpha Release
After months of silence, Ikey Doherty has released a new alpha for his Serpent OS.
-
HashiCorp Cofounder Unveils Ghostty, a Linux Terminal App
Ghostty is a new Linux terminal app that's fast, feature-rich, and offers a platform-native GUI while remaining cross-platform.
-
Fedora Asahi Remix 41 Available for Apple Silicon
If you have an Apple Silicon Mac and you're hoping to install Fedora, you're in luck because the latest release supports the M1 and M2 chips.
-
Systemd Fixes Bug While Facing New Challenger in GNU Shepherd
The systemd developers have fixed a really nasty bug amid the release of the new GNU Shepherd init system.
-
AlmaLinux 10.0 Beta Released
The AlmaLinux OS Foundation has announced the availability of AlmaLinux 10.0 Beta ("Purple Lion") for all supported devices with significant changes.