Zack's Kernel News
Zack's Kernel News
Chronicler Zack Brown reports on the little links that bring us closer within the Linux kernel community.
The Hot Mess of Closed Source
In the course of trying to track down a regression, Akihiro Suda traced the problem to a couple of patches that had been accepted into a recent kernel release. A regression is when something stops working and the developers have to look back at the patch history to see which one caused the breakage. Identifying regressions is what the Git bisect
command is for. It starts from a known good version and a known bad version, tests the middle version, and then just keeps going to the next middle version until it finds the bad patch that started it all. Git makes regressions fun.
However, this particular regression had to do with running virtualized systems and related to both Advanced Configuration and Power Interface Component Architecture (ACPICA), which is for discovering and configuring the hardware on a given system, and EFISTUB, which lets the (Unified Extensible Firmware Interface) UEFI load the Linux kernel as an EFI application.
The keyword is "firmware." Generally this is closed source software associated with a specific piece of hardware, without which the hardware won't run at all. Linux tolerates it because it has no choice, but as with the Basic Input/Output System (BIOS), firmware is generally always a broken, buggy pain in the butt. Sometimes developers will reverse engineer the firmware and write their open source version, but generally the closed source hot mess is what we get.
For example, when Akihiro reported the kernel regression, Ard Biesheuvel asked if he'd been using Open Virtual Machine Firmware (OVMF), an open source BIOS alternative, designed for booting virtual machines – specifically the Qemu open source virtualization system.
However, Akihiro replied that he wasn't using Qemu. He was using Apple's closed source virtualization framework, which also didn't use UEFI to load the kernel.
Akihiro said, "Despite that, it still expects LINUX_EFISTUB_MINOR_VERSION (include/linux/pe.h) referred from arch/x86/boot/header.S to be 0x0. I confirmed that the kernel can boot by just setting LINUX_EFISTUB_MINOR_VERSION to 0x0."
Akihiro said he would ask Apple to remove that particular check, as it seemed pointless. But he also asked the Linux developers, "Would it be possible to revert the LINUX_EFISTUB_MINOR_VERSION value (not the actual code) to 0x0? Or will it break something else?"
Ard felt that approaching Apple was probably the best move. He said, "If the existing virtual machine BIOS has a hardcoded check that the EFI stub version is 1.0 even if it does not boot via EFI to begin with, I don't see how we can reasonably treat this as a regression that needs fixing on the Linux side."
Ard also pointed out that there could be a significant cost to changing the Linux EFISTUB minor version requirement. He said, "the version bump to PE image version v1.1 sets a baseline across all Linux architectures that can boot via EFI that initrd loading is supported via the command line as well as via the LoadFile2 protocol. Reverting that would substantially reduce the value of having this identification embedded into the image."
It was at this point that Linus Torvalds came into the conversation. He explained:
"Well, we consider firmware issues to be the same as any hardware issue. If firmware has a bug that requires us to do things certain ways, that's really no different from hardware that requires some insane init sequence.
"So why not just say that LINUX_EFISTUB_MINOR_VERSION should be 0, and just add the comment that versioning doesn't work?
"I'm not sure why this was tied into always enabling the initrd command line loader.
"Numbered version checks are a fundamentally broken and stupid concept anyway. Don't do them. Just leave it at zero, and maybe some day there is a sane model that actually has a bitfield of capabilities and requirements."
Akihiro remarked, "Looks like Apple's vmlinuz loader only requires LINUX_EFISTUB_MINOR_VERSION to be 0x0 and does not care about LINUX_EFISTUB_MAJOR_VERSION."
And the thread ended.
Accommodating busted hardware and firmware is not new at all. It's fascinating to imagine how much thoroughly broken hardware is accommodated in the weirdest possible ways by the Linux kernel code, often resulting in poorer performance and avoiding whole swaths of high-powered hardware features that themselves represented big security holes. Some of these hardware bugs even get their own names, such as Spectre, Meltdown, Foreshadow, ZombieLoad, MDS, LazyFP, PortSmash, TLBleed, Plundervolt, CacheOut, and the list goes on.
TV Is the Thing This Year, This Year
Doug Berger wanted to eke out every last drop of RAM efficiency from the Linux kernel, especially when running Broadcom System-on-a-Chip (SoC) hardware such as BCM7445 and BCM7278, built for TVs. These systems and others like them, he said, "contain multiple memory controllers with each mapped in a different address range within a Uniform Memory Architecture." Uniform memory access (UMA) is generally found in cheap systems that aren't expected to handle heavy workloads. It stands in contrast with non-uniform memory access (NUMA), where each processor has its own local memory, allowing the system to make the best possible use of each processor's specialized capabilities. Desktop systems doing hardcore 3D gaming tend to be NUMA, while systems that process nothing but TV signals tend to be UMA.
UMA is fundamentally a corner-cutting technology. Memory is made available to all processors, at the cost of the faster processors having to accept the slower RAM built into the slower processors. But there are still ways to speed up memory usage – for example, by grouping the memory allocations for individual threads together. That way there would be less need to jump around between disparately located memory regions.
Not all memory allocations can simply be moved around, however. Some low-level system code needs to stay in one place. To distinguish between memory allocations that can and can't be moved around by the kernel, memory can be labeled as ZONE_MOVABLE
.
However, on these Broadcom UMA systems, it wasn't that easy. As Doug put it, "Unfortunately, the historical monotonic layout of zones would mean that if the lowest addressed memory controller contains ZONE_MOVABLE memory then all of the memory available from memory controllers at higher addresses must also be in the ZONE_MOVABLE zone."
This in turn, he went on, "would force all kernel memory accesses onto the lowest addressed memory controller and significantly reduce the amount of memory available for non-movable allocations." In other words, the kernel itself and everything it needed to do.
Doug posted a patch to create what he called "Designated Movable Blocks," which the kernel would use to satisfy requests for movable blocks of RAM.
Part of the whole problem is that the kernel could simply not do UMA at all and treat Broadcom TVs and all other such devices as full NUMA systems. After all, they have multiple CPUs, so why not do the standard thing? But this gets back to why UMA exists in the first place – to make as much RAM as possible available to all CPUs on a system containing scarce resources. As Doug put it, "NUMA architectures support distributing movable core memory across each node, but it is undesirable to introduce the overhead and complexities of NUMA on systems that don't have a Non-Uniform Memory Architecture."
In response to Doug's patch, Mel Gorman replied with some objections. In particular, he wasn't convinced that the patch would actually improve memory usage on those Broadcom systems.
Mel had actually been one of the main people to implement ZONE_MOVABLE
support in the kernel in the first place, and he had some serious regrets on that score. He said, "Zones are about addressing limitations primarily and frankly, ZONE_MOVABLE was a bad idea in retrospect." A better idea, he said, would have been to treat UMA as just a special case of NUMA. Specifically, "create a separate NUMA node with distance-1 to the local node [...] that was ZONE_MOVABLE with the zonelists structured such that GFP_MOVABLE allocations would prefer the 'movable' node first."
Mel lamented, "While I don't recall why I did not take that approach, it most likely was because CONFIG_NUMA was not always set, it was only intended for hugetlbfs allocations and maybe I didn't have the necessary skill or foresight to take that approach."
Mel lambasted, "A major limitation of ZONE_MOVABLE is that there is no way of controlling access from userspace to restrict the high-speed memory to a designated application, only to all applications in general. The primary interface to control access to memory with different characteristics is mempolicies which is NUMA orientated, not zone orientated. So, if there is a special application that requires exclusive access, it's very difficult to configure based on zones. Furthermore, page table pages mapping data located in the high-speed region are stored in the slower memory which potentially impacts the performance if the working set of the application exceeds TLB reach. Finally, while there is mention that Broadcom may have some special interface to determine what applications can use the high-speed region, it's hardware-specific as opposed to something that belongs in the core mm."
He offered more comments, but finally Mel seemed very clear on the fact that "The high bandwidth memory should be representated as a NUMA node, optionally to create that node as ZONE_MOVABLE and relying on the zonelists to select the movable zone as the first preference."
However, Doug replied, "It remains true that CONFIG_NUMA is not always set and that is a key motivator for this patch set. For example, Google is moving to a common GKI kernel for their Google TV platform that they are requiring vendors to support. Currently the arm64 GKI kernel does not set CONFIG_NUMA and it seems unlikely that we will be able to get all vendors to accept such a change."
Doug also remarked, "This patch set is fundamentally about greater control over the placement of movablecore memory. The current implementation of movablecore requires all of the ZONE_MOVABLE memory to be located at the highest physical addresses of the system when CONFIG_NUMA is not set. Allowing the specification of a base address allows greater flexibility on systems where there are benefits."
He added, "I don't believe this is really about trying to optimize the performance of a specific application as much as trying to prevent overall system performance degradation from underutilized memory bandwidth." Elsewhere he remarked, "the approach taken here is very much a 'poor man's' approach that attempts to improve things without requiring the 'heavy lifting' required for a more complete solution."
He further explained:
"What is of interest to Broadcom customers is to better distribute user space accesses across each memory controller to improve the bandwidth available to user space dominated work flows. With no ZONE_MOVABLE, the BCM7278 SoC with 1GB of memory on each memory controller will place the 1GB on the low address memory controller in ZONE_DMA and the 1GB on the high address memory controller in ZONE_NORMAL. With this layout movable allocation requests will only fallback to the ZONE_DMA (low memory controller) once the ZONE_NORMAL (high memory controller) is sufficiently depleted of free memory.
"Adding ZONE_MOVABLE memory above ZONE_NORMAL with the current movablecore behavior does not improve this situation other than forcing more kernel allocations off of the high memory controller. User space allocations are even more likely to be on the high memory controller.
"The Designated Movable Block mechanism allows ZONE_MOVABLE memory to be located on the low memory controller to make it easier for user space allocations to land on the low memory controller. If ZONE_MOVABLE is only placed on the low memory controller then user space allocations can land in ZONE_NORMAL on the high memory controller, but only through fallback after ZONE_MOVABLE is sufficiently depleted of free memory which is just the reverse of the existing situation. The Designated Movable Block mechanism allows ZONE_MOVABLE memory to be located on each memory controller so that user space allocations have equal access to each memory controller until the ZONE_MOVABLE memory is depleted and fallback to other zones occurs."
At a certain point, Mel did acknowledge, "Ok, I did misunderstand at the time that ZONE_MOVABLE would be split between the controllers to improve interleaving of user accesses."
But the discussion got a bit technical and a bit heated – at one point David Hildenbrand remarked, "Adding feature A because people don't want to (! whoever the 'people' are) enable feature B? I hope I don't have to tell you what I think about statements like this :)" To which Florian Fainelli replied:
"It is not just that NUMA is not wanted, it is also not a great fit, the ARM CPU cluster and most peripherals that Linux cares about do have an uniform memory access to the available DRAM controllers/DRAM chips.
"Only a subset of the peripherals, especially the real-time and high bandwidth ones like video decoders and display[s] that may not be uniformly accessing DRAM. This stems from the fact that the memory controller(s) on the System-on-Chip we work with have a star topology and they schedule the accesses of each DRAM client (CPU, GPU, video decoder, display, Ethernet, PCIe, etc) differently in order to guarantee a certain quality of service."
There was no resolution during the discussion. It seems that the advocates of Designated Movable Blocks feel that this feature is simply an extension of existing behaviors, some of which were coded by the very people criticizing those behaviors now. On the other hand, the critics of Doug's patch feel that the underlying thing Doug wants to enhance should really go away entirely, or at least should be redesigned to use the better and more generic NUMA approach.
The interesting thing for me about this whole conversation is that it is fundamentally a debate between the better design (NUMA) versus the current reality (UMA). If I think about how Linus Torvalds comes down on these issues, it's not clear at all that he favors one over the other. In many cases, yes, he'll insist on the better design, even if it means heaping a ton of work on the heads of people trying to implement one little feature. On the other hand, sometimes the "better design" is a large and abstract thing that has no actual users, while the current reality is something clean and easy to fix or implement, and Linus will choose to go with the current reality.
It's unclear whether Doug's patch would ever come to Linus's attention – it seems that Mel is the gatekeeper in this particular case, having written the original code. But it doesn't seem like a clear-cut decision, at least as far as this particular email discussion went.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Fedora Asahi Remix 41 Available for Apple Silicon
If you have an Apple Silicon Mac and you're hoping to install Fedora, you're in luck because the latest release supports the M1 and M2 chips.
-
Systemd Fixes Bug While Facing New Challenger in GNU Shepherd
The systemd developers have fixed a really nasty bug amid the release of the new GNU Shepherd init system.
-
AlmaLinux 10.0 Beta Released
The AlmaLinux OS Foundation has announced the availability of AlmaLinux 10.0 Beta ("Purple Lion") for all supported devices with significant changes.
-
Gnome 47.2 Now Available
Gnome 47.2 is now available for general use but don't expect much in the way of newness, as this is all about improvements and bug fixes.
-
Latest Cinnamon Desktop Releases with a Bold New Look
Just in time for the holidays, the developer of the Cinnamon desktop has shipped a new release to help spice up your eggnog with new features and a new look.
-
Armbian 24.11 Released with Expanded Hardware Support
If you've been waiting for Armbian to support OrangePi 5 Max and Radxa ROCK 5B+, the wait is over.
-
SUSE Renames Several Products for Better Name Recognition
SUSE has been a very powerful player in the European market, but it knows it must branch out to gain serious traction. Will a name change do the trick?
-
ESET Discovers New Linux Malware
WolfsBane is an all-in-one malware that has hit the Linux operating system and includes a dropper, a launcher, and a backdoor.
-
New Linux Kernel Patch Allows Forcing a CPU Mitigation
Even when CPU mitigations can consume precious CPU cycles, it might not be a bad idea to allow users to enable them, even if your machine isn't vulnerable.
-
Red Hat Enterprise Linux 9.5 Released
Notify your friends, loved ones, and colleagues that the latest version of RHEL is available with plenty of enhancements.