Zack's Kernel News

Article from Issue 228/2019

Author(s): Zack Brown

Creating libperf; Using GCC Extensions; Editing the Laws of the Universe.

Creating libperf

Jiri Olsa recently posted a large patchset to begin the process of migrating the perf profiling code out of the core kernel code and into its own libperf library. perf is a debugging tool that is virtually never encountered by regular users. It profiles parts of the kernel in order to identify bottlenecks and other slowdowns. This lets the developers know which areas of the kernel might offer a big reward for receiving their attention.

Converting perf to a library is certainly a large task, so Jiri's code was just an initial pass at creating a library infrastructure (based largely on the existing libbpf code), onto which more and more could be migrated over time. Initially, the code was limited to basic counting operations, such as tallying up the number of CPUs and threads or enabling and disabling events. In the perf world, an "event" is a trigger point that allows perf to do various actions in the midst of a piece of kernel code that has no idea it's being profiled.

The problem with creating a full libperf library to entirely replace perf in one fell swoop is that it's too prone to errors. Jiri said the amount of code that would need to be rewritten was truly vast, so the likelihood of creating lots and lots of bugs was pretty high. Creating the stable basic infrastructure first, and then migrating the various pieces in a relatively straightforward progression, would avoid that problem and would make the bug-hunting process much cleaner for everyone.

Jiri's plans for the future were to start adding ways to actually use the collected data and event handling that had been part of the initial pass. Eventually the code would also migrate away from the perf default directory into a new directory more appropriate for a public library.

Ian Rogers from Google was thrilled with this whole project. His team had been working on methods of converting perf data to other forms, to avoid overhead – presumably because they were running perf on production systems that needed to be sleek and fast yet still report problems to the sys admin teams.

Some of Ian's concern was interface cleanliness and compatibility with C++, which was the language they mostly used on his team. Jiri replied that with a little more work, libperf would soon offer some higher-level interfaces that Ian might find simple and useful for his needs.

Song Liu also replied to Jiri's initial post, pointing out that the event code in Jiri's initial patchset was really an abstraction rather than a full interface. He added that there were a lot of tools that currently used perf that didn't rely on those abstractions. He said, "I am not sure whether these tools would adopt libperf, as libperf changes their existing concepts/abstractions."

Jiri replied that libperf would eventually include the rest of the perf API, so those other tools would not be forced to adapt abstractions that didn't match their use cases.

Arnaldo Carvalho de Melo also replied to Song, reiterating Jiri's point, saying, "for now, we're just trying to have something that is not so tied to perf and could possibly be useful outside tools/perf/ when the need arises for whatever new tool or preexisting one. There are features there that may be interesting to use outside perf; time will tell." Arnaldo added that Jiri "is just slowly moving things to a public libperf while keeping perf working; in the end, the goal is to have as much stuff that is not super specific to some of the existing perf tools (tools/perf/builtin-*.c) in libperf as possible. It is still early in this effort; that is why he is still leaving it in tools/perf/lib/ and not in tools/lib/perf/."

Song replied that he liked this strategy and admired the amount of work it would take.

Arnaldo also replied to Jiri's initial patchset, saying, "I've tested it in various distros and made fixes in the relevant csets to avoid breaking bisection; it builds everywhere I tested so far, except on Fedora Rawhide, but that is something unrelated, a coincidence since I refreshed that container yesterday (one Python hiccup and something else); I've made some changes to the docs adding some articles and adding some clarification about refcounts not necessarily destroying the object, just dropping a reference, pushed everything to tmp.perf/core, and will do the whole container testing soon."

And Alexey Budankov from Intel offered his approval of the whole patchset as well. He remarked, "Some API for reading perf record trace could be valuable extensions for the library. Also at some point public API will, probably, need some versioning."

There was no reply to that, and the conversation came to an end. Personally I love seeing the progression of these various tools. There are so many of them! Something is needed for the Linux kernel. Someone creates it (or more often, a lot of people try to create it, and something coherent gradually emerges). Eventually, someone sees that it could have a wider value than just for the Linux kernel, and, at some point, it is abstracted out of the core kernel code and made into a standalone thing that helps people throughout the world.

Using GCC Extensions

Interactions between the kernel source tree and the GCC compiler are almost always strange. Their incestuous intertwinings and rivalrous collaborations have forced their respective developers to deal with many harsh truths, concerning which favorite feature is truly to be determined by one project or the other.

In this particular case, Joe Perches wanted the Linux kernel to support GCC's fallthrough attribute, introduced in GCC v7. GCC attributes are special hints that can appear directly in your code, but instead of being part of the C or C++ language, they are interpreted specially by GCC, to help it produce the absolute best possible machine code in your compiled binary.

The fallthrough attribute, for example, is used in switch statements, to indicate to GCC that a given "case" is going to be allowed to fall through to the enclosing block, rather than jumping anywhere else.

The justification is clear – GCC doesn't want to extend the C language, because the goal is not to produce more code, but to prevent code from being produced. The goal of these linguistic extensions is to avoid the need to produce machine code. So, instead of extending the language directly, GCC allows users to insert these non-C-language hints into perfectly good C code.

Joe posted patches to update the kernel source tree, to reserve all 4,200 occurrences of the word "fallthrough" as a "pseudo keyword," so it would only appear in the source tree in places where it was actually intended to be used as a GCC attribute. The kicker is that one of the ways to use these attributes is not as text inserted directly into the code, but as a standard C comment, bounded by /* and */.

Peter Zijlstra burst into applause. And Pavel Machek immediately approved the patch for inclusion in the kernel. Pavel also asked if the "fallthrough" C comment would also be recognized by GCC if it appeared in a macro; Joe replied that GCC would, but the non-GCC code checkers and other tools probably would not.

Kees Cook was also concerned that he did not want to break existing code scanners, especially the Coverity scanner. He said, "I'd like to make sure we don't regress Coverity most of all. If the recent updates to the Coverity scanner include support for the attribute now, then I'm all for it." But to this, Peter replied, "Coverity can go pound sand; I never see its output, while I get to look at the code and GCC output daily."

Miguel Ojeda also pointed out that this entire topic had been raised in the past, when the conclusion was that it would be better to wait until the ecosystem of surrounding tools supported the attribute and wouldn't be broken by the change. He asked, "Is everyone happy this time around?"

Joe replied to Miguel, saying that in fact, his patches didn't actually change the kernel to use the fallthrough attribute – they only reserved the word "fallthrough." So, he said, "Patches that convert /* fallthrough */ et al. to fallthrough would only be published when everyone's happy enough."

Meanwhile, H. Peter Anvin asked what would happen if someone ran the kernel source tree through a comment stripper before compiling – specifically, could the comments be replaced with some other "magic token" to accomplish the same thing. This led to a technical discussion surrounding exactly how these attributes should be represented in the kernel, to work best with the various existing tools. Should it be a comment? Should the word "fallthrough" have underscores attached at either end? And so on. At one point Joe asked Linus Torvalds to make a determination, and Linus replied:

"My only real concern is that the comment approach has always been the really traditional one, going back all the way to lint days.

"And you obviously cannot use a #defineto create a comment, so this whole keyword model will never be able to do that.

"At the same time, all the modern tools we care about do seem to be happy with it, either through the GCC attribute, the Clang [[clang:fallthrough]], or the (eventual) standard C [[fallthrough]] model.

"So I'm ok with just saying 'the comment model may be traditional, but it's not very good'."

The discussion continued a short while, essentially with implementation details. The real issue at the root of GCC/kernel interactions is that each project sees itself as more fundamental than the other. The GCC developers feel GCC is more fundamental, because it is responsible for building all software in the solar system, not just Linux; while the kernel developers feel the kernel is more fundamental, because it is responsible for running all hardware in the solar system. The GCC people aren't going to love the idea of implementing a feature just to make the kernel developers' lives easier; while the kernel people aren't going to love the idea of having to rely on GCC features that dictate how the final machine code will end up.

In the past, this debate led Linus to rely on an older and "better" version of GCC for a very long time, even as the GCC code continued to grow and develop. In some ways, it was the debate to end all debates, and now the GCC developers and kernel developers keep in better contact and have a friendlier relationship.

Editing the Laws of the Universe

Lately, the Linux kernel developers have been rewriting the core scheduler code. I repeat: the core scheduler code. This is the part of the kernel that decides how and when to switch between running processes. Process switching generally happens so rapidly that you can have tons of users all logged into your system at the same time, and all have a smooth, pleasant experience. It's what makes our computers "multitasking" instead of "single-tasking."

It's also notoriously difficult to test. How can you tell if one core scheduler implementation is better than another? Or for that matter, how can you tell if a single patch improves an existing core scheduler or makes it choppier or slower? The core scheduler is supposed to work well on billions of computers, including virtual systems, running on every conceivable hardware configuration, for users engaged in any conceivable set of use cases – not just porn.

In fact, there's really no way to perform exhaustive and correct tests. Developers working in the area of the core scheduler just basically … do their best. They think really hard. They invoke the muses. And they do their best to generate convincing explanations that fit into simple paragraphs in a changelog entry.

Vineeth Remanan Pillai (on behalf of many co-developers) recently posted the latest iteration of patches to rewrite the core scheduler. In this iteration, the code was mostly concerned with getting the basic ideas right, avoiding crashes, and running fast. Among other requirements, virtual systems and CPU hot plugging took center stage.

Aubrey Li posted with some test results, including problems with the tests themselves in this new version of the scheduler; Julien Desfossez and Aaron Lu immediately jumped in to help debug the tests.

At one point, Julien offered his assessment of these latest patches from Vineeth, saying, "it helps for untagged interactive tasks and fairness in general, but this increases the overhead of core scheduling when there is contention for the CPU with tasks of varying CPU usage. The general trend we see is that if there is a CPU-intensive thread and multiple relatively idle threads in different tags, the CPU-intensive tasks continuously yield to be fair to the relatively idle threads when it becomes runnable. And if the relatively idle threads make up for most of the tasks in a system and are tagged, the CPU-intensive tasks see a considerable drop in performance."

After finding the problem with Aubrey's testing scripts, Aubrey posted some new benchmarks, explaining that at his job, "The story [that] we care about latency is that some customers reported their latency critical job is affected when co-locating a deep learning job (AVX-512 task) onto the same core, because when a core executes AVX-512 instructions, the core automatically reduces its frequency. This can lead to a significant overall performance loss for a non-AVX-512 job on the same core."

With the new patches, Aubrey reported improved results in these tests.

Meanwhile, Julien, from the dark depths, remarked, "After reading more traces and trying to understand why only untagged tasks are starving when there are CPU-intensive tasks running on the same set of CPUs, we noticed a difference in behavior in pick_task. In the case where core_cookie is 0, we are supposed to only prefer the tagged task if its priority is higher, but when the priorities are equal we prefer it as well, which causes the starving. pick_task is biased toward selecting its first parameter in case of equality, which in this case was the class_pick instead of max. Reversing the order of the parameter solves this issue and matches the expected behavior."

Subhra Mazumdar also posted some benchmark tests, showing good results for database use cases. Subhra suggested that the particular configuration that produced this result should be made standard by default in the core scheduler. Julien replied that yes, this was not even going to be optional, but would just be the standard behavior in later versions of the scheduler.

Aubrey also reported some new benchmarks, in which a set of virtual systems appeared to suffer from unfair scheduling under certain circumstances. Unfairness refers to some processes getting more CPU time than others, or some CPUs being used more than others. Julien confirmed that this was a reproducible case and began looking for what might have caused it. Tim Chen was also interested in tracking this down and started hacking at the code to see what could be discovered.

This debugging session continued, with more tests and patches flying around, and the developers just generally having the time of their lives. More developers joined the fun, such as Dario Faggioli from SUSE, and it was all magic.

Nothing whatsoever was decided, accepted, rejected, or anything like that. What we had here was a bunch of people not caring in the slightest whether the gaze of history was on them and just playing the music of the universe with nothing but love. It was wonderful just to get to listen.

The Author

The Linux kernel mailing list comprises the core of Linux development activities. Traffic volumes are immense, often reaching 10,000 messages in a week, and keeping up to date with the entire scope of development is a virtually impossible task for one person. One of the few brave souls to take on this task is Zack Brown.

Buy this article as PDF

Express-Checkout as PDF

Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES

Print Issues

Digital Issues

SUBSCRIPTIONS

Print Subs

Digisubs

TABLET & SMARTPHONE APPS

US / Canada

UK / Australia